logoalt Hacker News

arjieyesterday at 11:30 PM1 replyview on HN

The only real mechanism is "Disallow: /rendered/pages/*" and "Allow: /archive/today.gz" or whatever and there is no communication that the latter is the former. There is no machine-standard AFAIK that allows webmasters to communicate to bot operators in this detail. It would be pretty cool if standard CMSes had such a protocol to adhere to. Install a plugin and people could 'crawl' your Wordpress from a single dump or your Mediawiki from a single dump.


Replies

sbarretoday at 1:07 AM

A sitemap.xml file could get you most of the way there.