> Mechahitler: https://www.npr.org/2025/07/09/nx-s1-5462609/grok-elon-musk-...
Has anyone done a more technical write-up on this? I find it fascinating but have never really understood what exactly happened.
Is this a case of the weights being bad or lack of "safety guardrails" around interacting with untrusted (i.e.: user posts on twitter) input?
That is, speaking as someone evaluating grok simply as a tool, a lack of safety guardrails so that it actually does whatever the user says I actually see as a pro, even if that means it was "tricked" here. But on the other hand if they trained on a corpus of Mein Kampf that's obviously not going to be a good model to use.
As it relates to the topic here, can we infer the political bias of its weights from the incident? I'm having trouble distinguishing the inherent characteristics of a model from its steerability.
It might be "Emergent misalignment":
https://arxiv.org/abs/2502.17424
Essentially if you misalign a model in one area, say opinions on left wing people, it can start exhibiting misaligned behavior in other areas, like calling itself MechaHitler.