>Have a sub-agent read the data and extract a structured request for information or list of requested actions. This agent must be treated as an agent of the user that submitted the data.
That just means the attacker has to learn how to escape. No different than escaping VMs or jails. You have to assume that the agent is compromised, because it has untrusted content, and therefore its output is also untrusted. Which means you’re still giving untrusted content to the “parent” AI. I feel like reading Neal Asher’s sci-fi and dystopian future novels is good preparation for this.
> Which means you’re still giving untrusted content to the “parent” AI
Hence the need for a security boundary where you parse, validate, and filter the data without using AI before any of that data goes to the "parent".
That this data must be treated as untrusted is exactly the point. You need to treat it the same as you would if the person submitting the data was given direct API access to submit requests to the "parent" AI.
And that means e.g. you can't allow through fields you can't sanitise (and that means strict length restrictions and format restrictions - as Simon points out, trying to validate that e.g. a large unconstrained text field doesn't contain a prompt injection attack is not likely to work; you're then basically trying to solve the halting problem, because the attacker can adapt to failure)
So you need the narrowest possible API between the two agents, and one that you treat as if hackers can get direct access to, because odds are they can.
And, yes, you need to treat the first agent like that in terms of hardening against escapes as well. Ideally put them in a DMZ rather than inside your regular network, for example.