Yeah, I agree with this. These types of roach motels have been around for decades and are at this p...

marginalia_nu • 01/16/2025 • 0 replies • view on HN

Yeah, I agree with this. These types of roach motels have been around for decades and are at this point well understood and not much of a problem for anyone. You basically need to be able to deal with them to do any sort of large scale crawling.

The reality of web crawling is that the web is already extremely adversarial and any crawler will get every imaginable nonsense thrown at it, ranging from various TCP tar pits, compression and XML bombs, really there's no end to what people will put online.

A more resource effective technique to block misbehaving crawlers is to have a hidden link on each page, to some path forbidden via robots.txt, randomly generated perhaps so they're always unique. When that link is fetched, the server immediately drops the connection and blocks the IP for some time period.

alt Hacker News