> define communication protocols between them that fail when prompt injections are present Ther...

krethh • today at 3:19 AM • 1 reply • view on HN

> define communication protocols between them that fail when prompt injections are present

There's the "draw the rest of the owl" of this problem.

Until we figure out a robust theoretical framework for identifying prompt injections (not anywhere close to that, to my knowledge - as OP pointed out, all models are getting jailbroken all the time), human-in-the-loop will remain the only defense.

Replies

CuriouslyC • today at 3:58 AM

Human in the loop isn't the only defense, you can't achieve complete injection coverage, but you can have an agent convert untrusted input into a response schema with a canary field, then fail any agent outputs that don't conform to the schema or don't have the correct canary value. This works because prompt injection scrambles instruction following, so the odds that the injection works, the isolated agent re-injects into the output, and the model also conforms to the original instructions regarding schema and canary is extremely low. As long as the agent parsing untrusted content doesn't have any shell or other exfiltration tools, this works well.

➕ show 1 reply

alt Hacker News

Replies