Doublespeak: In-Context Representation Hijacking

68 points • by surprisetalk • 12/22/2025 • 8 comments • view on HN

Comments

I guess I understand what is meant, but what is the actual attack? It’s more than a little abstracted from any consequences, like kids using google to search for boobs by typing ‘boobs’.

wood_spirit • yesterday at 10:57 PM

Intriguing and very cunning attack! So obvious in hindsight!

It makes me wonder how Deepseek avoids commenting politically on China? I have heard anecdotes that it will be writing out a long reply and then presumably it generates some forbidden phrase and it abandons the output and replaces it all with an error message. So presumably the safeguards could be a separate trivial non-LLM-based post filtering which makes it immune to the doublespeak attack?

➕ show 1 reply

measurablefunc • yesterday at 10:12 PM

This means whatever NNs are currently used for "safety" will need to be extended. In the limit you essentially get another network of the same width & depth as the original network but which is designed for rejecting all "unsafe" queries which are context hijacking bomb construction with stories about fruits.

amannm • today at 6:02 AM

Wasn't able to outsmart GPT 5.2 at least. Saw through it completely.

➕ show 1 reply

acjohnson55 • yesterday at 10:27 PM

These types of attacks are interesting ways in which LLM "thinking" differs from human thinking.

behnamoh • yesterday at 11:21 PM

summary: interesting idea, slop website, tested only on old AI models

alt Hacker News

Doublespeak: In-Context Representation Hijacking

Comments