logoalt Hacker News

StephenHerlihyyyesterday at 10:39 PM1 replyview on HN

I don’t know why anyone would still be trying to pull data off the open internet. Too much signal to noise. So much AI influence already baked into the corpus. You are just going to be reinforcing existing bias. I’m more worried about the day Amazon or Hugging Face take down their large data sets.


Replies

saaaaaamyesterday at 10:55 PM

MetaBrainz is a fairly valuable “high signal” dataset though.