logoalt Hacker News

mika-eltoday at 11:00 AM1 replyview on HN

The irony here is that the detection method is literally prompt injection — the same technique that's a security vulnerability everywhere else. ICML embedded hidden instructions in PDFs that manipulate LLM output. In a different context that's an attack, here it's enforcement.

From my perspective this says something important about where we are with LLMs. The fact that you can reliably manipulate model output by hiding instructions in the input means the model has no real separation between data and commands. That's the fundamental problem whether you're catching lazy reviewers or defending against actual attacks.


Replies

nulltracetoday at 11:45 AM

[dead]