There's a post every other month where some dude who put nonsense information online celebrates...

chmod775 • yesterday at 7:38 PM • 4 replies • view on HN

There's a post every other month where some dude who put nonsense information online celebrates because it actually ended up in some frontier models weights.

If it's easy enough that some randos can do it for fun, what do you think happens when there's commercial interest behind it?

Obviously companies are going try nudging AI towards recommending whatever they're selling. It's a logical extension of SEO - and that's a 100 billion USD industry.

Additionally, if I believed myself to be in some sort of spending - err - AI race, I'd try to poison the data sets of my competitors by putting crap out there for others to ingest.

Replies

aspenmartin • yesterday at 8:20 PM

It's not really a problem. We're out of natural tokens anyway. The future is synthetic verifiable traces (already the way we train coding agents).

➕ show 1 reply

brokencode • yesterday at 10:13 PM

There are so many better data sources that AI labs can use here that this argument really holds no water at all.

Peer reviewed journals, textbooks, in-house teams of experts, trusted news publications, etc.

The whole idea of scraping large swaths of the internet for training data has always been pretty dubious due to the variable data quality.

I mean, just look at the early Google models that told people to put glue in their pizza due to a joke in the training set. Garbage in, garbage out.

This is one of the first and most obvious problems all of these labs have run into, and countermeasures are only going to improve.

➕ show 1 reply

jurgenaut23 • yesterday at 8:24 PM

Do you have examples of such celebrations?

Shitty-kitty • yesterday at 9:26 PM

They already are, It has become a real problem in Reddit. Especially with the latest in pseudo-science crap like peptides.

alt Hacker News

Replies