logoalt Hacker News

Mr_Bees69yesterday at 2:55 PM1 replyview on HN

please add a robots.txt, its quite a d### move to people who build responsible crawlers for fun.


Replies

marginalia_nuyesterday at 5:48 PM

It's a fairly trivial inconvenience. You can just add something to the effect of the below code, and you'll not get stuck and realistically not skip over crawling anything of value.

  if (response_time > 8 seconds && response_payload < 2048 bytes) {
    extract_links = false;
  }
The odds of a payload that's smaller than the average <head> element taking 20 seconds to load, while containing something worth crawling is fairly low.