My bet is that they believe https:/&...

utopiah • yesterday at 6:48 AM • 2 replies • view on HN

My bet is that they believe https://commoncrawl.org isn't good enough and, precisely as you are suggesting, the "rest" is where is their competitive advantage might stem from.

Replies

ccgreg • yesterday at 10:58 PM

Most academic AI research and AI startups find Common Crawl adequate for what they're doing. Common Crawl also has a lot of not-AI usage.

Jackson__ • yesterday at 8:34 AM

Thinking that there is anything worth scraping past the llm-apocalypse is pure hubris imo. It is slop city out there, and unless you have an impossibly perfect classifier to detect it, 99.9% of all the great new "content" you scrape will be AI written.

E: In fact this whole idea is so stupid that I am forced to consider if it is just a DDoS in the original sense. Scrape everything so hard it goes down, just so that your competitors can't.

alt Hacker News

Replies