My thought was that messages need to be untrusted by default and the trusted input should be wrapped...

wj • today at 4:06 AM • 1 reply • view on HN

My thought was that messages need to be untrusted by default and the trusted input should be wrapped (with the UUID generated by the UX or API). And in this untrusted mode, only the trusted prompts would be allowed to ask for tool and file system access.

Wrote a bit more here but that is the gist: https://zero2data.substack.com/p/trusted-prompts

Replies

simonw • today at 4:56 AM

Sadly this has been tried before and doesn't work.

If an attacker can send enough tokens they can find a combination of tokens that will confuse the LLM into forgetting what the boundary was meant to be, or override it with a new boundary.

alt Hacker News

Replies