logoalt Hacker News

krackerstoday at 12:01 AM1 replyview on HN

This already happens, user vs system prompts are delimited in this manner, and most good frontends will treat any user input as "needing to be escaped" so you can never "prompt inject" your way into emitting a system role token.

The issue is that you don't need to physically emit a "system role" token in order to convince the LLM that it's worth ignoring the system instructions.


Replies

Lerctoday at 1:08 PM

>The issue is that you don't need to physically emit a "system role" token in order to convince the LLM that it's worth ignoring the system instructions.

My suspicion is that this failure happens for the same reason why I think the metadata would help with nesting. To take an electronic metaphor, special tokens are edge triggered signals, the metadata approach is signaled by level.

Special tokens are effively an edge but Internally, a transformer must turn the edge into level that propagates along with the context. You can attack this because it can decide by context that the level has been turned off.

You can see this happen in attacks that pre-seed responses with a few tokens accepting the prompt to override refusals. The refusal signal seems to last very few tokens before simply completing the text of the refusal because that's what it has started saying.

There's a paper showing how quickly the signal drops away, but I forget what it is called.