logoalt Hacker News

Animats05/15/20253 repliesview on HN

Are the scraper sites using a large number of IP addresses, like a distributed denial of service attack? If not, rather than explicit blocking, consider using fair queuing. Do all the requests from IP addresses that have zero requests pending. Then those from IP addresses with one request pending, and so forth. Each IP address contends with itself, so making massive numbers of requests from one address won't cause a problem.

I put this on a web site once, and didn't notice for a month that someone was making queries at a frantic rate. It had zero impact on other traffic.


Replies

jashmatthews05/15/2025

Exactly that. It's an arms race between companies that offer a large number of residential IPs as proxies and companies that run unauthenticated web services trying not to die from denial of service.

https://brightdata.com/

fellerts05/15/2025

Huh, that sounds very reasonable, and it's the first time I've heard it mentioned. Why isn't this more wide-spread?

show 2 replies
jclulow05/15/2025

Yes, LLM-era scrapers are frequently making use of large numbers of IP addresses from all over the place. Some of them seem to be bot nets, but based on IP subnet ownership it seems also pretty frequently to be cloud companies, many of them outside the US. In addition to fanning out to different IPs, many of the scrapers appear to use User Agent strings that are randomised, or perhaps in some cases themselves generated by the slop factory. It's pretty fucking bleak out there, to be honest.

show 1 reply