There's a post every other month where some dude who put nonsense information online celebrates because it actually ended up in some frontier models weights.
If it's easy enough that some randos can do it for fun, what do you think happens when there's commercial interest behind it?
Obviously companies are going try nudging AI towards recommending whatever they're selling. It's a logical extension of SEO - and that's a 100 billion USD industry.
Additionally, if I believed myself to be in some sort of spending - err - AI race, I'd try to poison the data sets of my competitors by putting crap out there for others to ingest.
There are so many better data sources that AI labs can use here that this argument really holds no water at all.
Peer reviewed journals, textbooks, in-house teams of experts, trusted news publications, etc.
The whole idea of scraping large swaths of the internet for training data has always been pretty dubious due to the variable data quality.
I mean, just look at the early Google models that told people to put glue in their pizza due to a joke in the training set. Garbage in, garbage out.
This is one of the first and most obvious problems all of these labs have run into, and countermeasures are only going to improve.
Do you have examples of such celebrations?
They already are, It has become a real problem in Reddit. Especially with the latest in pseudo-science crap like peptides.
It's not really a problem. We're out of natural tokens anyway. The future is synthetic verifiable traces (already the way we train coding agents).