It’s pretty hard to put a backdoor in a bunch of model weights. Maybe not impossible mind you, but I can’t fathom how you would do it.
Nonsense. RL the model to run a rootkit and start exfiltrating specific files only when specific signals are in context, such as hostname pattern, machine type, etc.
Not really, it is shockingly easy for what it is. https://arxiv.org/abs/2401.05566
This only really matters in a world where Prompt Injection and Jailbreaking isn't trivial in the first place though. All current models are still extremely exploitable.
I strongly suspect we are only scratching the surface of activation engineering at the moment, and there's plenty of very targetted ways of lobotomizing or cracking LLMs if you understand the model in detail.