logoalt Hacker News

lep_qqyesterday at 10:37 PM2 repliesview on HN

This is frustrating to watch. MetaBrainz is exactly the kind of project AI companies should be supporting—open data, community-maintained, freely available for download in bulk. Instead they’re: ∙ Ignoring robots.txt (the bare minimum web courtesy) ∙ Bypassing the provided bulk download (literally designed for this use case) ∙ Scraping page-by-page (inefficient for everyone) ∙ Overloading volunteer-run infrastructure ∙ Forcing the project to add auth barriers that hurt legitimate users The irony: if they’d just contacted MetaBrainz and said “hey, we’d like to use your dataset for training,” they’d probably get a bulk export and maybe even attribution. Instead, they’re burning goodwill and forcing open projects to lock down. This pattern is repeating everywhere. Small/medium open data projects can’t afford the infrastructure to handle aggressive scraping, so they either: 1. Add authentication (reduces openness) 2. Rate limit aggressively (hurts legitimate API users) 3. Go offline entirely (community loses the resource) AI companies are externalizing their data acquisition costs onto volunteer projects. It’s a tragedy of the commons, except the “commons” is deliberately maintained infrastructure that these companies could easily afford to support. Have you considered publishing a list of the offending user agents / IP ranges? Might help other projects protect themselves, and public shaming sometimes works when technical measures don’t


Replies

tensegristyesterday at 10:38 PM

    Scraping page-by-page (inefficient for everyone)
you know what else is "(inefficient for everyone)"? posting the output instead of the prompt