logoalt Hacker News

dijksterhuistoday at 2:23 AM1 replyview on HN

adversarial examples, or test-time attacks, was a whole field of machine learning security way before LLMs came around.

give the model a specially crafted bad input at inference time so attacker can get some nasty output, potentially defeating any existing defences in the process. [0]

in “modern llm lingo” defence = guardrails and / or system prompts.

prompts used for prompt injection are a form of adversarial example (people just like inventing new terminology when a new fad comes along).

[0]: i wrote the above myself about adv. ex, but i’ve just checked OWASP’s listing on prompt injection and it’s pretty close: https://owasp.org/www-community/attacks/PromptInjection


Replies

Lerctoday at 7:04 AM

That is a whole field of which, Prompt injection is a class. but That's like saying upon discovering plutonium that we've known about matter for years.

Most machine learning mechanism performs a fixed function. You can make an adversarial example to tell an image classifier that a machine gun is a kitten.

You cannot give a image classifier an image that makes it say all of the following images are images of kittens.

I would distinguish prompt injections as distinct from a basic adversarial example by virtue of having behaviour dictated by state, (autoregressive, rnn or whatever) and the adversarial content induces a state that influences further inferences

I am not saying that prompt injection does not exist. I'm saying that I don't think that has been conclusively shown that they cannot be avoided.