I don't think that most of them are from big-name companies. I run a personal web site that has...

philipkglass • yesterday at 9:30 PM • 1 reply • view on HN

I don't think that most of them are from big-name companies. I run a personal web site that has been periodically overwhelmed by scrapers, prompting me to update my robots.txt with more disallows.

The only big AI company I recognized by name was OpenAI's GPTBot. Most of them are from small companies that I'm only hearing of for the first time when I look at their user agents in the Apache logs. Probably the shadiest organizations aren't even identifying their requests with a unique user agent.

As for why a lot of dumb bots are interested in my web pages now, when they're already available through Common Crawl, I don't know.

Replies

iamnothere • yesterday at 9:46 PM

Maybe someone is putting out public “scraper lists” that small companies or even individuals can use to find potentially useful targets, perhaps with some common scraper tool they are using? That could explain it? I am also mystified by this.

alt Hacker News

Replies