Who are these agressive scrapers run by? It is difficult to figure out the incentives here. Why wo...

fancyfredbot • yesterday at 9:26 PM • 14 replies • view on HN

Who are these agressive scrapers run by?

It is difficult to figure out the incentives here. Why would anyone want to pull data from LWN (or any other site) at a rate which would cause a DDOS like attack?

If I run a big data hungry AI lab consuming training data at 100Gb/s it's much much easier to scrape 10,000 sites at 10Mb/s than DDOS a smaller number of sites with more traffic. Of course the big labs want this data but why would they risk the reputational damage of overloading popular sites in order to pull it in an hour instead of a day or two?

Replies

overfeed • yesterday at 10:49 PM

> If I run a big data hungry AI lab consuming training data at 100Gb/s it's much much easier to...

You are incorrectly assuming competency, thoughtful engineering and/or some modicum of care for negative externalities. The scraper may have been whipped up by AI, and shipped an hour later after a quick 15-minute test against en.wikipedia.org.

Whoever the perpetrator is, they are hiding behind "residential IP providers" so there's no reputational risks. Further, AI companies already have a reputation for engaging in distasteful practices, but popular wisdom claims that they make up for the awfulness with utility, so even if it turns out to be a big org like OpenAI or Anthropic, people will shrug their shoulders and move on.

➕ show 1 reply

dannyobrien • yesterday at 10:23 PM

I've been asking this for a while, especially as a lot of the early blame went on the big, visible US companies like OpenAI and Anthropic. While their incentives are different from search engines (as someone said early on in this onslaught, "a search engine needs your site to stay up; an AI company doesn't"), that's quite a subtle incentive difference. Just avoiding the blocks that inevitably spring up when you misbehave is a incentive the other way -- and probably the biggest reason robots.txt obedience, delays between accesses, back-off algorithms etc are widespread. We have a culture that conveys all of these approaches, and reciprocality has its part, but I suspect that's part of the encouragement to adopt them. It could that they're just too much of a hurry not to follow the rules, or it could be others hiding behind those bot-names (or others). Unsure.

Anyway, I think the (currently small[1]) but growing problem is going to be individuals using AI agents to access web-pages. I think this falls under the category of the traffic that people are concerned about, even though it's under an individual users' control, and those users are ultimately accessing that information (though perhaps without seeing the ads that pay of it). AI agents are frequently zooming off and collecting hundreds of citations for an individual user, in the time that a user-agent under manual control of a human would click on a few links. Even if those links aren't all accessed, that's going to change the pattern of organic browsing for websites.

Another challenge is that with tools like Claude Cowork, users are increasingly going to be able to create their own, one-off, crawlers. I've had a couple of occasions when I've ended up crafting a crawler to answer a question, and I've had to intervene and explicitly tell Claude to "be polite", before it would build in time-delays and the like (I got temporarily blocked by NASA because I hadn't noticed Claude was hammering a 404 page).

The Web was always designed to be readable by humans and machines, so I don't see a fundamental problem now that end-users have more capability to work with machines to learn what they need. But even if we track down and sucessfully discourage bad actors, we need to work out how to adapt to the changing patterns of how good actors, empowered by better access to computation, can browse the web.

[1] - https://radar.cloudflare.com/ai-insights#ai-bot-crawler-traf...

➕ show 2 replies

philipkglass • yesterday at 9:30 PM

I don't think that most of them are from big-name companies. I run a personal web site that has been periodically overwhelmed by scrapers, prompting me to update my robots.txt with more disallows.

The only big AI company I recognized by name was OpenAI's GPTBot. Most of them are from small companies that I'm only hearing of for the first time when I look at their user agents in the Apache logs. Probably the shadiest organizations aren't even identifying their requests with a unique user agent.

As for why a lot of dumb bots are interested in my web pages now, when they're already available through Common Crawl, I don't know.

➕ show 1 reply

velox_neb • yesterday at 9:51 PM

I bet some guy just told Claude Code to archive all of LWN for him on a whim.

➕ show 2 replies

bjackman • yesterday at 9:37 PM

LWN includes archives of a bunch of mailing lists so that might be a factor. There are a LOT of web on that domain.

phil21 • yesterday at 11:04 PM

I'd guess some sort of middle management local maxima. Someone set some metric of X pages per day scraped, or Y bits per month - whatever. CEO gets what he wants.

Then that got passed down to the engineers and those engineers got ridden until they turned the dial to 11. Some VP then gets to go to the quarterly review with a "we beat our data ingestion metrics by 15%!".

So any engineer that pushes back basically gets told too bad, do it anyways.

➕ show 2 replies

delfinom • yesterday at 11:05 PM

As someone that runs the infrastructure for a large OSS project. Mostly Chinese AI firms. All the big name brand AI firms play reasonably nice and respect robots.txt.

The Chinese ones are hyper aggressive, with no rate limit and pure greed scraping. They'll scrape the same content hundreds of times the same day

➕ show 2 replies

ks2048 • yesterday at 10:42 PM

Perhaps incompetence instead of malice - a misconfigured or buggy scraper, etc.

alephnerd • yesterday at 10:59 PM

> If I run a big data hungry AI lab consuming training data at 100Gb/s it's much much easier to scrape 10,000 sites at 10Mb/s than DDOS a smaller number of sites with more traffic

A little over a decade ago (f*ck I'm old now [0]), I had a similar conversation with an ML Researcher@Nvidia. Their response was "even if we are overtraining, it's a good problem to have because we can reduce our false negative rate".

Everyone continues to have an incentive to optimize for TP and FP at the expense of FN.

[0] - https://m.youtube.com/watch?v=BGrfhsxxmdE

mikkupikku • yesterday at 9:41 PM

NSA, trying to force everybody onto their Cloudflare reservation.

kylehotchkiss • yesterday at 9:28 PM

china (alibaba and tencent)

➕ show 1 reply

gubicle • yesterday at 9:45 PM

[dead]

alt Hacker News

Replies