logoalt Hacker News

Doublespeak: In-Context Representation Hijacking

68 pointsby surprisetalk12/22/20258 commentsview on HN

Comments

hyperhellotoday at 10:41 AM

I guess I understand what is meant, but what is the actual attack? It’s more than a little abstracted from any consequences, like kids using google to search for boobs by typing ‘boobs’.

wood_spirityesterday at 10:57 PM

Intriguing and very cunning attack! So obvious in hindsight!

It makes me wonder how Deepseek avoids commenting politically on China? I have heard anecdotes that it will be writing out a long reply and then presumably it generates some forbidden phrase and it abandons the output and replaces it all with an error message. So presumably the safeguards could be a separate trivial non-LLM-based post filtering which makes it immune to the doublespeak attack?

show 1 reply
measurablefuncyesterday at 10:12 PM

This means whatever NNs are currently used for "safety" will need to be extended. In the limit you essentially get another network of the same width & depth as the original network but which is designed for rejecting all "unsafe" queries which are context hijacking bomb construction with stories about fruits.

amannmtoday at 6:02 AM

Wasn't able to outsmart GPT 5.2 at least. Saw through it completely.

show 1 reply
acjohnson55yesterday at 10:27 PM

These types of attacks are interesting ways in which LLM "thinking" differs from human thinking.

behnamohyesterday at 11:21 PM

summary: interesting idea, slop website, tested only on old AI models