To put this into perspective, What.CD [0] was widely considered to be the music library of Alexandria, unparalleled in both its high quality standard and it's depth. What had in the ballpark of a few million torrents when it got raided and shut down. Anna's rip of Spotify includes roughly 186 million unique records. Granted, the tail end is a mixed bag of bot music and whatnot, but the scale is staggering.
Truly amazing work. I couldn't help but being sad of the less popular songs not being currently stored, as those are definitely the ones more in risk of being lost forever.
If you like the goal and you have even a few 100gb available on your server, consider "donating" some of that space to seeding the data (music or books). It's absolutely how we can fight the system, even if just a tiny bit. https://annas-archive.org/torrents
I just found out that https://annas-archive.li/ is masked by my German internet provider (SIM.de/Drillisch). I usually use a VPN but I had it switched off temp. to watch Fallout (Prime Video won't let you watch through a VPN). Only when I switched Mullvad back on could I open the site.
I didn't know German providers do this.
This work is so critical.
Read an article that was published just 10 years ago, and witness the bit rot as most external links will 404, gone forever.
I think it's worth questioning the value of preserving -everything-, but it seems like if we can, we should.
We can finally search for playlists with a giving song! A basic feature that Spotify is missing!
Incredible.
> A while ago, we discovered a way to scrape Spotify at scale.
They wont and shouldn’t divulge the details, but I imagine that would be a fun read!
I recall many interesting tracks that were very aggressively deleted from all platforms in sync. I wonder if I could find them in this archive.
There is contemporary lost media being created every day because of how we distribute things now. I think in some cases, the intent of the publisher was to literally destroy every copy of the information. I understand the legal arguments for this, but from a spiritual perspective, this is one of the most offensive things I can imagine. Intentionally destroying all copies of a creative work is simply evil. I don't care how you frame it.
Making media effectively lost is not much different in my mind. Is it available if it's sitting on a tape in an iron mountain bunker that no one will ever look at again?
I'd rather see them use AI to convert all the scanned scientific articles into proper PDF or other formats.
Also sort and classify the articles by binary size, vs page count, plot count, raster image count etc, in order to compress the outliers and detect when a raster image should have been a plot and convert it to vectorized images etc.
How compact can we get the collective human scientific corpus?
This might be the perfect time to do archiving before the entire internet gets inundated by sub-par AI generated content.
This is something really important, especially in the days when music and film vanishes from platforms one by one. I myself have three playlists with greyed out titles (titles are missing so there's no possibility for me to find out what was there).
That's why I divide music to the one that I want to have forever - I buy it on CDs - and dance music that I can live without one day
Hmm. This is actually not really something I need, I think; but I consider anna's archive etc... as about as important as the internet web archive. We need to preserve data, at the least important data, also historic data - how the original websites looked. Creativity of past generations. Same for games and books.
It may be only ~30 years for webpages to have emerged, but there are also many young people who may not have experienced that since they are too young to have experienced it. There is always a generational change; our generation has the opportunity to store more things.
Hmmm I don’t like this. There are sources for music with better quality out there and all this will do is paint them a bigger target for takedowns/prosecution. I am worried about losing their ebook library. Quoting from the announcement: “Generally speaking, music is already fairly well preserved.“ They should have done this as a separate identity.
Not that we should, but it's technically feasible to have a music streaming server with the torrent as the backend, and selectively download the part of the torrent in respond to on-demand streaming request from the client.
The metadata alone is incredibly valuable for researchers. Having 186 million ISRCs catalogued with associated genre, tempo, and popularity data is a goldmine for music analysis that doesn't even require touching the audio files.
I've always found it interesting how streaming services have become the de facto music library of record, yet they can and do remove content at will. When Spotify pulled out of Russia, entire catalogs became inaccessible. Physical media and personal archives suddenly matter again in ways we thought were obsolete.
The copyright discussion is complex, but from a pure preservation standpoint, I'm glad someone is doing this work.>Over-focus on the most popular artists. There is a long tail of music which only gets preserved when a single person cares enough to share it. And such files are often poorly seeded.
There is a ton of good bands with under 10k or even 1k monthly listeners.
Anna’s Archive has largely flown under the radar by focusing on books.
Even perceived involvement in music piracy puts a much bigger target on their back from far more aggressive actors (RIAA, major labels)
Since the article asks:
> We're curious about the peaks at whole minutes (particularly 2:00, 3:00, 4:00). If you know why this is, please let us know!
As a hobby video/audio editor, people will start with their track taking up a preset amount and fill up the time - even if it means having some dead space at the end.
The other alternative is algorithmically created music.
Moral and legal discussion aside, this is technically very impressive. I also wouldn’t be surprised if this somehow kickstarts open source music generative AI from China.
This is one of the greatest news I've ever heard for the digital preservation community. Just so many projects over the years could have used resources like this. Thank you for contributing to humankind!
I wonder how deep the hole they're gonna put whoever runs this site into is gonna be?
Site is down for me. Archive link: https://archive.is/jf3HW
Is the music torrent not up yet? Only see the metadata one here: https://annas-archive.li/torrents/spotify
I have Spotify premium but the constant shuffle of content availability has meant I’ve stared routinely archiving my liked songs to avoid any rug pull. Zspotify and co still work a charm.
Amazing! I wonder if the Every Noise At Once[1] site could be updated with the metadata from this?
Music files (releasing in order of popularity)
Increasing or decreasing? IMHO increasing would make more sense, as the most popular music is already mirrored in countless other places. It's the rare stuff that is most in need of preservation.
I wonder how much of the content there is AI-generated. Honestly, even as someone who was initially skeptical, I've found some of it to be rather good --- not knowing that it was AI-generated at first. Now if they could only reverse-engineer the prompt and only store the model, that would be an extremely efficient form of "compression".
It seems to be that the metadata doesn't include the lyrics, probably because they are provided by Musixmatch. It would have been nice to have a database of lyrics linked to ISRCs. AFAIK Lrclib doesn't support downloading lyrics for a given ISRC.
Unrelated, but I just can't stop myself from saying that I absolutely hate Spotify even though I'm a paying customer. Fuck you Spotify. You were supposed to be a convenient way to discover and listen to music. Now you are only convenient for listening to music, and absolutely terrible for any recommendations. This is sad really. Spotify had good recommendations. It's absolutely in a position where it can provide good recommendations — it has both a vast music library and a vast amount of data on user preferences. And it chooses to push procedural/ai-generated slop instead to earn more money. I thought that maybe buying $SPOT stock will make me more at peace with its greed, but it didn't work. Spotify fucking deserves to crash and burn because it sees paying customers as idiots who might not notice they are fed garbage. Fuck you Spotify, fuck you.
wow. Blocked in Belgium.
Error HTTP 451 - Unavailable For Legal Reasons
This is incredible. I once assembled a collection of 100,000 tracks for research on exploration of large music libraries. Essentially vector search. I was limited in storage and processing power to a single machine.
If I were to do it today, I could get so much farther with hyperscaler products and this dataset.
This will be great to train AI on.
What an early christmas gift for humanity. Now, asking for a friend, what's the ideal setup for torrenting this? Mullvad / Tailscale?
Quoting from their page:
--------------
This is by far the largest music metadata database that is publicly available. For comparison, we have 256 million tracks, while others have 50-150 million. Our data is well-annotated: MusicBrainz has 5 million unique ISRCs, while our database has 186 million.
--------------
If they truly are on a mission to protect world's information from disappearing, they should work with MusicBrainz to get this data on it.
Alternatively, it would be amazing, if they built a MusicBrainz like service around it.
In either case, to make the data truly useful, they'd need to solve the problem on how to match the metadata to a fingerprint used to identify the music tracks, assuming that data is not part of the metadata they collected.
I wonder if they'll explore other music services as well. As I understand it, Deezer, Qobuz, and Tidal can all get ripped easily enough. Although I'm not sure if they rate limit downloads past a certain point.
I'm a bit sad that they chose to focus on music rather than audiobooks. Creating an archive of audiobooks seem like it would be more aligned with their mission.
Attracting the ire of the music industry seems like a huge, unnecessary risk. I wish they had performed this as some kind of other entity to try to keep the ebook archive protected from the fallout. I fear this will not end well.
Can someone explain why C#/Db (major/minor) is the third most popular key? Very unexpected for me, since its relatively more difficult to play.
Can this last?
I envision an army of lawyers and cyber security companies being prepared to unleash a scorched earth campaign that book publishers might want to be part of as well.
At the end it may take down more than just this publication but most others as well.
I just want to be able to backup my playlists. Maybe thats possible but last time I looked I could only find sites that wanted your login, not gonna happen.
Oh, just noticed my provider "Vodafone Germany" is blocking the domain annas-archive.li on DNS level.
So nice! That's an excellent extract and looks useful for benchmarking Meilisearch. I'll probably spend my Christmas holidays importing the tracks, albums, and artists into Meilisearch, while my CEO builds a beautiful front-end for it. I'll probably replace [the current music search demo](https://music.meilisearch.com) we have with this much higher-quality dataset!
That would also be a good fit for [the new delta-encoded posting lists I am working on](https://github.com/meilisearch/meilisearch/pull/5985). Let's see how good it can get. My early benchmarks showed a 50% reduction in disk usage.
Merry Christmas!
Just buy music DRM-free in the first place.
TIL Anna's Archive is blocked in Germany (by a rather obtrusive MitM, I might add). Get redirected to a "Copyright Clearing House" or something.
GREAT DAY
I hope someone builds an open API around this metadata. I'd love to have alternatives to the big player APIs.
I want to peek in that metadata collection to see if it could be used to identify the AI slop that's infecting Spotify.
If you could identify a track supposedly by artist X was actually AI slop not created by artist X, you could use that information to skip tracks on (web) music players, for example.
I wonder if Spotify will pursue any legal actions to take this archive or the site down!
Uh, cool, I guess? I want to applaud that, but, first off, unless you are OpenAI or Facebook, it is not exactly plausibly easy to participate in the festivities. Even if I had spare 300 TB laying around, how the fuck do I download that?
But, more importantly, I cannot even say "good for you", because I don't actually think it is good for Anna's Archive. I wouldn't touch that thing, if I was them. Do we even have any solid alternatives for books, if Anna's Archive gets shot down, by the way? Don't recommend Amazon, please.
I am not enthused by this news. Let us entertain the possibility that similar institutions will eschew this catalog.
We need insane for culture to survive.
This is insane.
I definitely was not aware Spotify DRM had been cracked to enable downloading at scale like this.
The thing is, this doesn't even seem particularly useful for average consumers/listeners, since Spotify itself is so convenient, and trying to locate individual tracks in massive torrent files of presumably 10,000's of tracks each sounds horrible.
But this does seem like it will be a godsend for researchers working on things like music classification and generation. The only thing is, you can't really publicly admit exactly what dataset you trained/tested on...?
Definitely wondering if this was in response to desire from AI researchers/companies who wanted this stuff. Or if the major record labels already license their entire catalogs for training purposes cheaply enough, so this really is just solely intended as a preservation effort?