logoalt Hacker News

giantrobotyesterday at 10:05 PM0 repliesview on HN

In the most charitable case it's some "AI" companies with an X/Y problem. They want training data so they vibe code some naive scraper (requests is all you need!) and don't ever think to ask if maybe there's some sort of common repository of web crawls, a CommonCrawl if you will.

They don't really need to scrape training data as CommonCrawl or other content archives would be fine for training data. They don't think/know to ask what they really want: training data.

In the least charitable interpretation it's anti-social assholes that have no concept or care about negative externalities that write awful naive scrapers.