logoalt Hacker News

arjieyesterday at 10:47 PM2 repliesview on HN

Someone convinced me last time[0] that these aren't the well-known scrapers we know but other actors. We wouldn't be able to tell, certainly. I'd like to help the scrapers be better about reading my site, but I get why they aren't.

I wish there were an established protocol for this. Say a $site/.well-known/machine-readable.json that instructs you on a handful of established software or allows pointing to an appropriate dump. I would gladly provide that for LLMs.

Of course this doesn't solve for the use-case where the AI companies are trying to train their models on how to navigate real world sites so I understand it doesn't solve all problems, but one of the things I think I'd like in the future is to have my own personal archive of the web as I know it (Internet Archive is too slow to browse and has very tight rate-limits) and I was surprised by how little protocol support there is for robots.

robots.txt is pretty sparse. You can disallow bots and this and that, but what I want to say is "you can get all this data from this git repo" or "here's a dump instead with how to recreate it". Essentially, cooperating with robots is currently under-specified. I understand why: almost all bots have no incentive to cooperate so webmasters do not attempt to. But it would be cool to be able to inform the robots appropriately.

To archive Metabrainz there is no way but to browse the pages slowly page-by-page. There's no machine-communicable way that suggests an alternative.

0: https://news.ycombinator.com/item?id=46352723


Replies

saaaaaamyesterday at 10:54 PM

As referenced in the article, there absolutely is an alternative.

https://metabrainz.org/datasets

Linked to from the homepage as “datasets”.

I may be too broadly interpreting what you mean by “machine-communicable” in the context of AI scraping though.

show 1 reply
squigzyesterday at 10:54 PM

> To archive Metabrainz there is no way but to browse the pages slowly page-by-page. There's no machine-communicable way that suggests an alternative.

Why does there have to be a "machine-communicable way"? If these developers cared about such things they would spend 20 seconds looking at this page. It's literally one of the first links when you Google "metabrainz"

https://metabrainz.org/datasets

show 1 reply