logoalt Hacker News

fartfeaturesyesterday at 10:47 PM3 repliesview on HN

> They won't believe a random site when it says "Look, stop hitting our API, you can pick all of this data in one go, over in this gzipped tar file."

What mechanism does a site have for doing that? I don't see anything in robots.txt standard about being able to set priority but I could be missing something.


Replies

arjieyesterday at 11:30 PM

The only real mechanism is "Disallow: /rendered/pages/*" and "Allow: /archive/today.gz" or whatever and there is no communication that the latter is the former. There is no machine-standard AFAIK that allows webmasters to communicate to bot operators in this detail. It would be pretty cool if standard CMSes had such a protocol to adhere to. Install a plugin and people could 'crawl' your Wordpress from a single dump or your Mediawiki from a single dump.

show 1 reply
jacksnipeyesterday at 10:50 PM

It’s not great, but you could add it to the body of a 429 response.

show 1 reply
squigzyesterday at 10:51 PM

The mechanism is putting some text that points to the downloads.

show 1 reply