I would say there's a couple aspects. The crawlers for the big famous names in AI are all les...

zerocrates • yesterday at 10:43 PM • 0 replies • view on HN

I would say there's a couple aspects.

The crawlers for the big famous names in AI are all less well behaved and more voracious than say, Googlebot. Though this is all somewhat muddied by companies that ran the former "good" crawlers all also being in the AI business and sometimes trying to piggyback on people having allowed or whitelisted their search crawling User-Agent, mostly this has settled a little where they're separating Googlebot from GoogleOther, facebookexternalhit from meta-externalagent, etc. This was an earlier "wave" of increased crawling that was obviously attributable to AI development. In some cases it's still problematic but this is generally more manageable.

The other stuff, the ones that are using every User-Agent under the sun and a zillion datacenter IPs and residential IPs and rotate their requests constantly so all your naive and formerly-ok rate-based blocking is useless... that stuff is definitely being tagged as "for AI" on the basis of circumstantial evidence. But from the timing of when it seemed to start, the amount of traffic and addresses, I don't have any problem guessing with pretty high confidence that this is AI. To your question of "who are the customers"... who's got all the money in the world sloshing around at their fingertips and could use a whole bunch of scraped pages about ~everything? Call it lazy reasoning if you'd like.

How much this traces back ultimately to the big familiar brand names vs. would-be upstarts, I don't know. But a lot of sites are blocking their crawlers that admit who they are, so would I be surprised to see that they're also paying some shady subcontractors for scrapes and don't particularly care about the methods? Not really.

alt Hacker News