logoalt Hacker News

seanmcdirmidtoday at 3:53 PM2 repliesview on HN

It’s pretty hard to put a backdoor in a bunch of model weights. Maybe not impossible mind you, but I can’t fathom how you would do it.


Replies

arw0ntoday at 7:00 PM

Not really, it is shockingly easy for what it is. https://arxiv.org/abs/2401.05566

This only really matters in a world where Prompt Injection and Jailbreaking isn't trivial in the first place though. All current models are still extremely exploitable.

I strongly suspect we are only scratching the surface of activation engineering at the moment, and there's plenty of very targetted ways of lobotomizing or cracking LLMs if you understand the model in detail.

show 1 reply
CuriouslyCtoday at 4:09 PM

Nonsense. RL the model to run a rootkit and start exfiltrating specific files only when specific signals are in context, such as hostname pattern, machine type, etc.

show 1 reply