Metabrainz is a great resource -- I wrote about them a few years ago here:

dannyobrien • yesterday at 10:40 PM • 5 replies • view on HN

Metabrainz is a great resource -- I wrote about them a few years ago here: https://www.eff.org/deeplinks/2021/06/organizing-public-inte...

There's something important here in that a public good like Metabrainz would be fine with the AI bots picking up their content -- they're just doing it in a frustratingly inefficient way.

It's a co-ordination problem: Metabrainz assumes good intent from bots, and has to lock down when they violate that trust. The bots have a different model -- they assume that the website is adversarially "hiding" its content. They won't believe a random site when it says "Look, stop hitting our API, you can pick all of this data in one go, over in this gzipped tar file."

Or better still, this torrent file, where the bots would briefly end up improving the shareability of the data.

Replies

tux • yesterday at 11:31 PM

Yeah AI scrapers is one of the reasons why i have closed my public website https://tvnfo.com and only left donors site online. It’s not only because of AI scrapers but i grew tired of people trying to scrape the site eating a lot of reasorcers this small project don’t have. Very sad really it was publicly online since 2016. Now it’s only available for donors. Running a tiny project on just $60 a month. If this was not my hobby i would close it completely long time ago :-) Who know if there is more support in the future i might reopen public site again with something like anubes bot protection. But i thought it was only small sites like mine who gets hit hard, looks like many have similar issues. Soon nothing will be open or useful online. I wonder if this was the plan all along whoever pushing AI on massive scale.

➕ show 1 reply

fartfeatures • yesterday at 10:47 PM

> They won't believe a random site when it says "Look, stop hitting our API, you can pick all of this data in one go, over in this gzipped tar file."

What mechanism does a site have for doing that? I don't see anything in robots.txt standard about being able to set priority but I could be missing something.

➕ show 3 replies

toofy • today at 1:21 AM

> The bots have a different model -- they assume that the website is adversarially "hiding" its content.

this should give us pause. if a bot considers this adversarial and is refusing to respect the site owners wishes, thats a big part of the problem.

a bot should not consider that “adversarial”

zzo38computer • today at 12:04 AM

> They won't believe a random site when it says "Look, stop hitting our API, you can pick all of this data in one go, over in this gzipped tar file."

Is there a mechanism to indicate this? The "a" command in the Scorpion crawling policy file is meant for this purpose, but that is not for use with WWW. (The Scorpion crawling policy file also has several other commands that would be helpful, but also are not for use with WWW.)

There is also the consideration to know what interval they will be archived that can be downloaded in this way; for data that changes often, you will not do it every time. This consideration is also applicable for torrents, since a new hash will be needed for a new version of the file.

m463 • today at 12:15 AM

> Or better still, this torrent file, where the bots would briefly end up improving the shareability of the data.

that is an amazing thought.

alt Hacker News

Replies