I've only read this blog and not the paper so maybe they go into more detail there and someone can correct me, but they frequently bring up the model's ability to detect or at least the model activations hint it can predict when it's being tested. I can't help but wonder, as they build these larger and larger models, where they could be getting "clean" training data, untainted by all these types of blog posts and the massive numbers of conversations they spawn? If the models ingest data like that wouldn't it make sense they'd be inclined to have more activations attuned to questions they appear adversarial?