logoalt Hacker News

lambdatoday at 4:30 AM0 repliesview on HN

A piece of software that you write, in code, unless you use random numbers or multiple threads without synchronization, will operate in a deterministic way. You know that for a given input, you'll get a given output; and you can reason about what happens when you change a bit, or byte, or token in the input. So you can be sure, if you implement a parser correctly, that it will correctly distinguish between one field that comes from a trusted source, and another that comes from an untrusted source.

The same is not true of an LLM. You cannot predict, precisely, how they are going to work. They can behave unexpectedly in the face of specially crafted input. If you give an LLM two pieces of text, delimited with a marker indicating that one piece is trusted and the other is untrusted, even if that marker is a special token that can't be expressed in band, you can't be sure that it's not going to act on instructions in the untrusted section.

This is why even the leading providers have trouble with protecting against prompt injection; when they have instructions in multiple places in their context, it can be hard to make sure they follow the right instructions and not the wrong ones, since the models have been trained so heavily to follow instructions.