New prompt injection papers: Agents rule of two and the attacker moves second

94 points • by simonw • last Sunday at 11:11 PM • 37 comments • view on HN

Comments

Hey folks, one of the authors of the original post here.

First, I want to thank simonw for coming up with the lethal trifecta (our direct inspiration for this work) as well as all of the great feedback we’ve received from Simon and others! Our goal with publishing this framework was to inspire precisely these types of discussions so our industry can move our understanding of these risks forward.

Regarding the concerns over the venn diagram labeling certain intersections sections as “safe”, this is 100% valid and we’ve updated it to be more clear. The goal of the Rule of Two is not to describe a sufficient level of security for agents, but rather a minimum bar that’s needed to deterministically prevent the highest security impacts of prompt injection. The earlier framing of “safe” did not make this clear.

Beyond prompt injection there are other risks that have to be considered, which we briefly describe in the Limitations section of the post. That said, we do see value in having the Rule of Two to frame some of the discussions around what unambiguous constraints exist today because of the unsolved risk of prompt injection.

Looking forward to further discussion!

simonw • yesterday at 10:07 AM

I added this section to my post just now: https://simonwillison.net/2025/Nov/2/new-prompt-injection-pa...

> On thinking about this further there’s one aspect of the Rule of Two model that doesn’t work for me: the Venn diagram above marks the combination of untrustworthy inputs and the ability to change state as “safe”, but that’s not right. Even without access to private systems or sensitive data that pairing can still produce harmful results. Unfortunately adding an exception for that pair undermines the simplicity of the “Rule of Two” framing!

➕ show 5 replies

jFriedensreich • yesterday at 12:07 PM

I am confused this article does not talk about taint tracking. If state was mutated by an agent with untrustworthy input the taint would transfer to the state, making it untrustworthy input too, so the reasoning of the original trifecta with taint tracking is more general and practical. I am also also investigating the direction of tracking taints as scores rather than binary as most use cases would otherwise be impossible to do at all autonomous. Eg. with sensitivity scores to data, trust scores to inputs (that can be improved by eg. human review). One important limit that needs way more research is how to transfer the minimal needed information from a tainted context into an untainted fresh context without transferring all the taints. The only solution i currently have is by compaction and human review, if possible aided with schema enforcement and optimised UI for the use case. This unfortunately cannot solve encoded information that humans cannot see, but it seems that issue will never be solvable outside alignment research.

PS: An example how scores are helpful: Using browser tab titles in the context would by definition have the worst trust score possible. But truncating titles to only the user-visible parts could lower this to acceptable for autonomous execution if the data was just mildly sensitive.

➕ show 2 replies

behnamoh • yesterday at 7:43 AM

I actually want prompt injection to remain possible. So many lazy academic paper reviewers nowadays delegate the review process to AI. It'd be cool if we could inject prompts in the paper that would stop the AI from aiding in such situations. In my experience, prompt injection techniques work for non-reasoning models but gpt-5-high easily ignores them...

➕ show 1 reply

ares623 • yesterday at 6:02 AM

I don’t know if it’s just me but doesn’t a huge value of LLMs for the general population necessitate all 3 of the circles?

Having just 2 circles requires a person in the loop, and that person will still need knowledge and experience and a low enough throughput to meaningfully action the workload otherwise they would just rubber stamp everything (which is essentially the 3rd circle with extra steps)

➕ show 3 replies

gs17 • yesterday at 1:53 PM

> [A] An agent can process untrustworthy inputs

> [B] An agent can have access to sensitive systems or private data

> [C] An agent can change state or communicate externally

Somewhat reminds me of the CAP theorem, where you can pick two of three, but one is effectively required for something useful. It seems more like the choice is really between "untrustworthy inputs" and "sensitive systems", which makes sense.

kubb • yesterday at 7:33 AM

I’m sorry, what kind of rule is that? How does it guarantee security?

It sounds like we’re making things up at this point.

➕ show 2 replies

ArcHound • yesterday at 9:55 AM

I'm sorry, but the rule of two is just not enough, not even as a rule of thumb.

We know how to work with security risks, the issue is they depend both on the business and the technicalities.

This can actually do a lot of harm as security now needs to dispel this "great approach" to ignoring security that is supported by a "research paper they read".

Please don't try to reinvent the wheel and if you do, please learn about the current state (Chesterton's fence and all that).

➕ show 1 reply

iberator • yesterday at 11:33 AM

Just make it a crime in caught. 1 year is prison at least

➕ show 2 replies

r0x0r007 • yesterday at 7:31 AM

Nice, why don't we apply the same principles to our regular applications? Ooh, right, cause we couldn't use them and a whole industry got created that's called cybersecurity and it's supposed to be consulted BEFORE releasing privacy nightmares and using them. But hey, regular applications can't come up with cool poems.

➕ show 1 reply

alt Hacker News

New prompt injection papers: Agents rule of two and the attacker moves second

Comments