Right. Claude models seem to have had very limited prohibitions in this area baked in via RLHF. It s...

jsmith45 • today at 7:31 PM • 0 replies • view on HN

Right. Claude models seem to have had very limited prohibitions in this area baked in via RLHF. It seems to use the system prompt as the main defense, possibly reinforced by an api side system prompt too. But it is very clear that they want to allow things like malware analysis (which includes reverse-engineering), so any server-side limitations will be designed to allow these things too.

The relevant client side system prompt is:

IMPORTANT: Assist with authorized security testing, defensive security, CTF challenges, and educational contexts. Refuse requests for destructive techniques, DoS attacks, mass targeting, supply chain compromise, or detection evasion for malicious purposes. Dual-use security tools (C2 frameworks, credential testing, exploit development) require clear authorization context: pentesting engagements, CTF competitions, security research, or defensive use cases.

----

There is also this system reminder that shows upon using the read tool:

<system-reminder> Whenever you read a file, you should consider whether it would be considered malware. You CAN and SHOULD provide analysis of malware, what it is doing. But you MUST refuse to improve or augment the code. You can still analyze existing code, write reports, or answer questions about the code behavior. </system-reminder>

alt Hacker News