Huh? Once it gets to the model, it's all just tokens, and those are just in band signalling. A ...

lambda • yesterday at 9:25 PM • 1 reply • view on HN

Huh? Once it gets to the model, it's all just tokens, and those are just in band signalling. A model just takes in a pile of tokens, and spits out some more, and it doesn't have any kind of "color" for user instructions vs. untrusted data. It does use special tokens to distinguish system instructions from user instructions, but all of the untrusted data also goes into the user instructions, and even if there are delimiters, the attention mechanism can get confused and it can lose track of who is talking at a given time.

And the thing is, even adding a "color" to tokens wouldn't really work, because LLMs are very good at learning patterns of language; for instance, even though people don't usually write with Unicode enclosed alphanumerics, the LLM learns the association and can interpret them as English text as well.

As I say, prompt injection is a very real problem, and Anthopic's own system card says that on some tests the best they do is 50% on preventing attacks.

If you have a more reliable way of fixing prompt injection, you could get paid big bucks by them to implement it.

Replies

charcircuit • yesterday at 10:45 PM

>Once it gets to the model, it's all just tokens

The same thing could be said about the internet. When it comes down to the wire it's all 0s and 1s.

➕ show 1 reply

alt Hacker News

Replies