As multi-step reasoning and tool use expand, they effectively become distinct actors in the threat m...

jcims • last Thursday at 9:27 AM • 1 reply • view on HN

As multi-step reasoning and tool use expand, they effectively become distinct actors in the threat model. We have no idea how many different ways the alignment of models can be influenced by the context (the anthropic paper on subliminal learning [1] was a bit eye opening in this regard) and subsequently have no deterministic way to protect it.

1 - https://alignment.anthropic.com/2025/subliminal-learning/

Replies

zbentley • last Thursday at 10:33 PM

I’d argue they’re only distinct actors in the threat model as far as where they sit (within which perimeters), not in terms of how they behave.

We already have another actor in the threat model that behaves equivalently as far as determinism/threat risk is concerned: human users.

Issue is, a lot of LLM security work assumes they function like programs. They don’t. They function like humans, but run where programs run.

alt Hacker News

Replies