logoalt Hacker News

oli5679today at 5:57 PM2 repliesview on HN

Would llms be more robust to this prompt injection if the tags used in fine tuning are sanitised from user input?

E.g. map <think> -> THINK <user> -> USER <tool> -> TOOL

If they learn something specific in the chat finetuning stage, this might show LLM its user input text not these tag references.


Replies

TheSoftwareGuytoday at 6:25 PM

If you read the whole thing, the answer is plainly no:

> It's worth pausing on what this means. LLMs identify roles from an insecure feature (style). This is like identifying a stranger's profession from how they talk and dress rather than by checking their ID.

The LLM is deducing the role of the text from not just the tags, but the style of writing

mrobtoday at 6:08 PM

You can filter out any tokens you like, but the point of the paper is that it's not sufficient, because LLMs often ignore the special label tokens and treat user-injected text as chain-of-thought text merely because it looks like chain-of-thought text, even if it's not labelled as such.