> context exhausition attack Can you give a high-level overview o...

_verandaguy • today at 2:15 PM • 4 replies • view on HN

    > context exhausition attack

Can you give a high-level overview of how this AV works? I'm a bit of an infosec geek but I generally dislike LLMs, so I haven't done a terribly good job of keeping up with that side of the industry, but this seems particularly interesting.

Replies

Sharlin • today at 2:29 PM

Presumably they mean the fundamental failure mode of LLMs that if you fill their context with stuff that stretches the bounds of their "safety training", suddenly deciding that "no, this goes too far" becomes a very low-probability prediction compared to just carrying on with it.

r_lee • today at 2:36 PM

as the context fills up, the model will generate based on that context, incl. whatever illegal stuff you've said, i.e. it'll mimic that, instead of whatever safety prompt they have at the top

they could make it more "safe" but it'd be much more invasive and would likely have to scan much more tokens also, and it'd cause false positives (probably the biggest reason it's not implemented)

himata4113 • today at 2:25 PM

I don't really know how these models really work, but I had a theory that just as the models have limited attention so do the safety layers. I simply populated enough context with 'malicious' text without making the model trip that "wasted" the internal attention budget on tokens early in the prompt completely ignoring all the tokens that were generated after the fact.

lcnPylGDnU4H9OF • today at 2:34 PM

Models have a "context window" of tokens they will effectively process before they start doing things that go against the system prompt. In theory, some models go up to 1M tokens but I've heard it typically goes south around 250k, even for those models. It's not a difficult attack to execute: keep a conversation going in the web UI until it doesn't complain that you're asking for dangerous things. Maybe OP's specific results require more finesse (I doubt it), but the most basic attack is to just keep adding to the conversation context.

➕ show 1 reply

alt Hacker News

Replies