logoalt Hacker News

Ifkaluvayesterday at 7:50 PM1 replyview on HN

The big labs spend a ton of effort on dataset curation, precisely to prevent them from ingesting poison as you put it.

It goes further than that—they do lots of testing on the dataset to find the incremental data that produces best improvements on model performance, and even train proxy models that predict whether data will improve performance or not.

“Data Quality” is usually a huge division with a big budget.


Replies

conartist6yesterday at 8:54 PM

Jeez, why can't I have a data quality team filtering out AI slop!

show 1 reply