This is the "confused deputy problem". [0]
And capabilities [1] is the long-known, and sadly rarely implemented, solution.
Using the trifecta framing, we can't take away the untrusted user input. The system then should not have both the "private data" and "public communication" capabilities.
The thing is, if you want a secure system, the idea that system can have those capabilities but still be restricted by some kind of smart intent filtering, where "only the reasonable requests get through", must be thrown out entirely.
This is a political problem. Because that kind of filtering, were it possible, would be convenient and desirable. Therefore, there will always be a market for it, and a market for those who, by corruption or ignorance, will say they can make it safe.
You're a machine Simon, thank you for all of the effort. I have learned so much just from your comments and your blog.
Im still fixing sql and db command injection through APIs from juniors and now vibe coders. This just adds more work to do.
The ITT/TTI and TTS/STT have been particularly annoying to protect against. I don’t feel we’ve matured enough to have solid protections against such vectors yet.
How does Perplexity Comet and Dia not suffer from data leakage like this? They seem to completely violate the lethal trifecta principle and intermix your entire browser history, scraped web page data and LLM’s.
It must be so much extra work to do the presentation write-up, but it is much appreciated. Gives the talk a durability that a video link does not.
Maybe this will finally get people over the hump and adopt OSs based on capability based security. Being required to give a program a whitelist at runtime is almost foolproof, for current classes of fools.
I am against agents. (I will happy to be proved wrong, I want agents, especially agents that could drive my car, but that is another disappointment....)
There is a paradox in the LLM version of AI, I believe.
Firstly it is very significant. I call this a "steam engine" moment. Nothing will ever be the same. Talking in natural language to a computer, and having it answer in natural language is astounding
But! The "killer app" in my experience is the chat interface. So much is possible from there that is so powerful. (For people working with video and audio there are similar interfaces that I am less familiar with). Hallucinations are part of the "magic".
It is not possible to capture the value that LLMs add. The immense valuations of outfits like OpenAI are going to be very hard to justify - the technology will more than add the value, but there is no way to capture it to an organisation.
This "trifecta" is one reason. What use is an agent if it has no access or agency over my personal data? What use is autonomous driving if it could never go wrong and crash the car? It would not drive most of the places I need it to.
There is another more basic reason: The LLMs are unreliable. Carefully craft a prompt on Tuesday, and get a result. Resubmit the exact same prompt on Thursday and there is a different result. It is extortionately difficult to do much useful with that, for it means that every response needs to be evaluated. Each interaction with an LLM is a debate. That is not useful for building an agent. (Or an autonomous vehicle)
There will be niches where value can be extracted (interactions with robots are promising, web search has been revolutionised - made useful again) but trillions of dollars are being invested, in concentrated pools. The returns and benefits are going to be disbursed widely, and there is no reason they will accrue to the originators. (Nvidea tho, what a windfall!)
In the near future (a decade or so) this is going to cause an enormous economic dislocation and rearrangement. So much money poured into abstract mathematical calculations - good grief!
If you were wondering about the pelicans: https://baynature.org/article/ask-naturalist-many-birds-beac...
This is a fantastic way of framing it, in terms of simple fundamental principles.
The problem with most presentations of injection attacks is it only inspires people to start thinking of broken workarounds - all the things mentioned in the article. And they really believe they can do it. Instead, as put here, we have to start from a strong assumption that we can't fix a breakage of the lethal trifecta rule. Rather, if you want to break it, you have to analyse, mitigate and then accept the irreducible risk you just incurred.
The lethal trifecta is a problem problem (a big problem) but not the only one. You need to break a leg of all the lethal stools of AI tool use.
For example a system that only reads github issues and runs commands can be tricked into modifying your codebase without direct exfiltration. You could argue that any persistent IO not shown to a human is exfiltration though...
OK then you can sudo rm -rf /. Less useful for the attacker but an attack nonetheless.
However I like the post its good to have common terminology when talking about these things and mental models for people designing these kinds of systems. I think the issue with MCP is that the end user who may not be across these issues could be clicking away adding MCP servers and not know the issues with doing so.
One idea I've had floating about in my head is to see if we can control-vector our way out of this. If we can identify an "instruction following" vector and specifically suppress it while we're feeding in untrusted data, then the LLM might be aware of the information but not act on it directly. Knowing when to switch the suppression on and off would be the job of a pre-processor which just parses out appropriate quote marks. Or, more robustly, you could use prepared statements, with placeholders to switch mode without relying on a parser. Big if: if that works, it undercuts a different leg of the trifecta, because while the AI is still exposed to untrusted data, it's no longer going to act on it in an untrustworthy way.
The link to the article covering Google Deepmind's CaMeL doesn't work.
Presumably intended to go to https://simonwillison.net/2025/Apr/11/camel/ though
All of my MCPs, including browser automation, are very much deterministic. My backend provides a very limited amount of options. Say for doing my Amazon shopping, it is fed the top 10 options per search query, and can only put one in a cart. Then email me when its done for review, it can't actually control the browser fully.
Essentially I provide a very limited (but powerful) interactive menu for every MCP response, it can only respond with the Index of the menu choice, one number, it works really well at preventing scary things (which I've experienced) search queries with some parsing, but must fit in a given sites url pattern, also containerization ofc.
This is way more common with popular MCP server/agent toolsets than you would think.
For those interested in some threat modeling exercise, we recently added a feature to mcp-scan that can analyze toolsets for potential lethal trifecta scenarios. See [1] and [2].
[1] toxic flow analysis, https://invariantlabs.ai/blog/toxic-flow-analysis
[2] mcp-scan, https://github.com/invariantlabs-ai/mcp-scan
Interesting presentation, but the name is too generic to catch on.
> the lethal trifecta is about stealing your data. If your LLM system can perform tool calls that cause damage without leaking data, you have a whole other set of problems to worry about.
“LLM exfiltration trifecta” is more precise.
Great work! Great name!
I'm currently doing a Month of AI bugs series and there are already many lethal trifecta findings, and there will be more in the coming days - but also some full remote code execution ones in AI-powered IDEs.
I have been skeptical from day one of using any Gen AI tool to produce output for systems meant for external use. I’ll use it to better understand input and then route to standard functions with the same security I would do for a backend for a website and have the function send deterministic output.
This dude named a Python data analysis library after a retrocomputing (Commodore era) tape drive. He _definitely_ should stop trying to name things.
There is a single reason why this is happening and it is due to a flawed standard called “MCP”.
It has thrown away almost all the best security practices in software engineering and even does away with security 101 first principles to never trust user input by default.
It is the equivalent of reverting back to 1970 level of security and effectively repeating the exact mistakes but far worse.
Can’t wait for stories of exposed servers and databases with MCP servers waiting to be breached via prompt injection and data exfiltration.
It seems like the answer is basically taint checking, which has been known about for a long time (TTBOMK it was in the original Perl 5, and maybe before).
This is a very minor annoyance of mine, but is anyone else mildly annoyed at the increasing saturation of cool, interesting blog and post titles turn out to be software commentary?
Nothing against the posts themselves, but it's sometimes a bit absurd, like I'll click "the raging river, a metaphor for extra dimensional exploration", and get a guide for Claude Code. No it's usually a fine guide, but not quite the "awesome science fact or philosophical discussion of the day" I may have been expecting.
Although I have to admit it's clearly a great algorithm/attention hack, and it has precedent, much like those online ads for mobile games with titles and descriptions that have absolutely no resemblance to the actual game.
"One of my weirder hobbies is helping coin or boost new terminology..." That is so fetch!
Simon is a modern day Brooksley Born, and like her he's pushing back against forces much stronger than him.
The key thing, it seems to me, is that as a starting point, if an LLM is allowed to read a field that is under even partial control by entity X, then the agent calling the LLM must be assumed unless you can prove otherwise to be under control of entity X, and so the agents privileges must be restricted to the intersection of their current privileges and the privileges of entity X.
So if you read a support ticket by an anonymous user, you can't in this context allow actions you wouldn't allow an anonymous user to take. If you read an e-mail by person X, and another email by person Y, you can't let the agent take actions that you wouldn't allow both X and Y to take.
If you then want to avoid being tied down that much, you need to isolate, delegate, and filter:
- Have a sub-agent read the data and extract a structured request for information or list of requested actions. This agent must be treated as an agent of the user that submitted the data.
- Have a filter, that does not use AI, that filters the request and applies security policies that rejects all requests that the sending side are not authorised to make. No data that can is sufficient to contain instructions can be allowed to pass through this without being rendered inert, e.g. by being encrypted or similar, so the reading side is limited to moving the data around, not interpret it. It needs to be strictly structured. E.g. the sender might request a list of information; the filter needs to validate that against access control rules for the sender.
- Have the main agent operate on those instructions alone.
All interaction with the outside world needs to be done by the agent acting on behalf of the sender/untrusted user, only on data that has passed through that middle layer.
This is really back to the original concept of agents acting on behalf of both (or multiple) sides of an interaction, and negotiating.
But what we need to accept is that this negotiation can't involve the exchange arbitrary natural language.