logoalt Hacker News

tooltoweryesterday at 11:18 PM2 repliesview on HN

> Rather than downloading our dataset in one complete download, they insist on loading all of MusicBrainz one page at a time.

Is there a standard mechanism for batch-downloading a public site? I'm not too familiar with crawlers these days.


Replies

TeMPOraLyesterday at 11:27 PM

There isn't. There never was one, because vast majority of websites are actually selfish with respect to data, even when that's entirely pointless. You can see this even here, with how some people complain LLMs made them stop writing their blogs: turns out plenty of people say they write for others to read, but they care more about tracking and controlling the audience.

Anyway, all that means there was never a critical mass of sites large enough for a default bulk data dump discovery to become established. This means even the most well-intentioned scrappers cannot reliably determine if such mechanism exist, and have to scrap per-page anyway.

show 1 reply
EnderWTyesterday at 11:23 PM

They're not talking about downloading the web pages. The data is available in a bulk download: https://listenbrainz.org/data/