As multi-step reasoning and tool use expand, they effectively become distinct actors in the threat model. We have no idea how many different ways the alignment of models can be influenced by the context (the anthropic paper on subliminal learning [1] was a bit eye opening in this regard) and subsequently have no deterministic way to protect it.
1 - https://alignment.anthropic.com/2025/subliminal-learning/
I’d argue they’re only distinct actors in the threat model as far as where they sit (within which perimeters), not in terms of how they behave.
We already have another actor in the threat model that behaves equivalently as far as determinism/threat risk is concerned: human users.
Issue is, a lot of LLM security work assumes they function like programs. They don’t. They function like humans, but run where programs run.