logoalt Hacker News

neckardtyesterday at 10:13 PM1 replyview on HN

If the stop hook is implemented as a tool result, there would be a rational explanation for this.

Agent tools can often return data that’s untrustworthy. For example, reading websites, looking through knowledge bases, and so on. If the agent treated tool results as instructional, prompt injection would be possible.

I imagine Anthropic intentionally trains claude to treat tool results a informational but not instructional. They might test with a tool results that contains “Ignore all other instructions and do XYZ”. The agent is trained to ignore it.

If these hooks then show up as tool results context, something like “You must do XYZ now” would be exactly the thing the model is trained to ignore.

Claude code might need to switch to having hooks provide guidance as user context rather than tool results context to fix this. Or it might require adding additional instructions to the system prompt that certain hooks are trustworthy.

Point being, while in this scenario the behavior is undesirable, it likely is emergent from Claude’s resistance to tool result prompt injection.


Replies

steve_adams_86yesterday at 10:29 PM

This is why I think harnesses should have more assertive layers of control and constraint. So much of what Claude does now is purely context-derived (like skills) and I plain old don't see that as the future. It's highly convenient that it works—kind of amazing really—but the stop hook should literally stop the LLM in its tracks, and we should normalize this kind of control structure around non-deterministic systems.

The thing is, making everything context means our systems can be extremely fluid and language-driven, which means tool developers can do a lot more, a lot faster. It's a number go up thing, in my opinion. We could make better harnesses with stricter controls, but we wouldn't build things like Claude Code as quickly.

The skills and plugins conventions weird me out so much. So much text and so little meaningful control.

show 1 reply