A plausible theory I've seen going around: https://x.com/QiaochuYuan/status/2049307867359162460
I wish the blog mentioned more about why exactly training for nerdy personality rewarded mention of goblins. Since it's probably not a deterministic verifiable reward, at their level the reward model itself is another LLM. But this just pushes the issue down one layer, why did _that_ model start rewarding mentions of goblin?
I love the people thinking "I should ask ChatGPT and copy pasta the response to the (tweet|gh comment)"
It is a stateless text / pixel auto-complete it has no references of self, stop spreading this bs.
If you tell an LLM it's a mushroom you'll get thoughts considering how its mycelium could be causing the goblins.
This "theory" is simply role playing and has no grounding in reality.