The only real mechanism is "Disallow: /rendered/pages/*" and "Allow: &...

arjie • yesterday at 11:30 PM • 1 reply • view on HN

The only real mechanism is "Disallow: /rendered/pages/*" and "Allow: /archive/today.gz" or whatever and there is no communication that the latter is the former. There is no machine-standard AFAIK that allows webmasters to communicate to bot operators in this detail. It would be pretty cool if standard CMSes had such a protocol to adhere to. Install a plugin and people could 'crawl' your Wordpress from a single dump or your Mediawiki from a single dump.

Replies

sbarre • today at 1:07 AM

A sitemap.xml file could get you most of the way there.

alt Hacker News

Replies