logoalt Hacker News

arw0ntoday at 7:00 PM1 replyview on HN

Not really, it is shockingly easy for what it is. https://arxiv.org/abs/2401.05566

This only really matters in a world where Prompt Injection and Jailbreaking isn't trivial in the first place though. All current models are still extremely exploitable.

I strongly suspect we are only scratching the surface of activation engineering at the moment, and there's plenty of very targetted ways of lobotomizing or cracking LLMs if you understand the model in detail.


Replies

seanmcdirmidtoday at 8:55 PM

You have to hide it in the model and it has to be subtle or it will be discovered quickly (even if you can train against a specific safety detector). Again, I'm not saying its impossible, but it seems really hard to pull off.