It's frustrating that there's no way for people to (selectively) mirror the Internet Archive. $25-30M per year is a lot for a non-profit, but it's nothing for government agencies, or private corporations building Gen AI models.
I suspect having a few different teams competing (for funding) to provide mirrors would rapidly reduce the hardware cost too.
The density + power dissipation numbers quoted are extremely poor compared to enterprise storage. Hardware costs for the enterprise systems are also well below AWS (even assuming a short 5 year depreciation cycle on the enterprise boxes). Neither this article nor the vendors publish enough pricing information to do a thorough total cost of ownership analysis, but I can imagine someone the size of IA would not be paying normal margins to their vendors.
It's insane to me that in 2008 a bunch of pervs decentralized storage and made hentai@home to host hentai comics. Yet here we are almost 20 years later and we haven't generalized this solution. Yes I'm aware of the privacy issues h@h has (as a hoster you're exposing your real IP and people reading comics are exposing their IP to you) but those can be solved with tunnels, the real value is the redundant storage.
The fact AI companies are stripping mining IA for content and not helping to be part of the solution is egregious.
I would like to be able to pull content out of the Wayback Machine with a proper API [1]. I'd even be willing to pay a combination of per-request and per-gigabyte fees to do it. But then I think about the Archive's special status as a non-profit library, and I'm not sure that offering paid API access (even just to cover costs) is compatible with the organization as it exists.
[1] It looks like this might exist at some level, e.g. https://github.com/hartator/wayback-machine-downloader, but I've been trying to use this for a couple of weeks and every day I try I get a HTTP 5xx error or "connection refused."
Is running an IPFS node and pinning the internet archive's collections a good way to do this?
> $25-30M per year is a lot for a non-profit
$25 million a year is not remotely a lot for a non-profit doing any kind of work at scale. Wikimedia's budget is about seven times that. My local Goodwill chapter has an annual budget greater than that.
Facilitating mirroring seems like it would open up another can of liability worms for the IA, as well as, potentially, for those mirroring it. For example, they recently lost an appeal of a major lawsuit brought by book publishers. And then there's the Wayback Machine itself; who knows what they've hoovered up from the public internet over the years? Would you be comfortable mirroring that?
> $25-30M per year is a lot for a non-profit
First, whether IA or any other large non-profit/charity. When you are in the double-digit/triple-digit multi-million bracket, you are no longer a non-profit/charity. You are in effect a business with a non-profit status.
Whether IA or any other large entity, when you get to that size, you don't benefit from the "oh they are a poor non-profit" mindset IMHO.
To be able to spend $25-30M a year, you clearly have to have a solid revenue stream both immediate and in the pipeline, that's Finances 101. Therefore you are in a privileged and enviable position that small non-profits can only dream of.
Second, I would be curious to know how much of that is of their own doing.
By that I mean, its sure cute to be located in the former Christian Science church on Funston Avenue in San Francisco’s Richmond District.
But they could most likely save a lot of money if they were located in a carrier-neutral facility.
For example, instead of paying for expensive external fiber lines (no doubt multiple, due to redundancy), they would have large amounts of capacity available through simple cross-connects.
Similar on energy. Are they benefiting from the same economies of scale that a carrier-neutral facility does ?
I am not saying the way they are doing it is wrong. I'm just genuinely curious to know what premium they are paying for doing it like they are.
I'd like a Public Broadcasting Service for the Internet but I'm afraid that money would just be pulled from actual PBS at this point to support it.
i dunno why i keep imagining something like ipfs could help something like the internet archive...
Don’t put any stock into the numbers in the article. They are mostly made up out of thin air.
Pick the items you want to mirror and seed them via their torrent file.
https://help.archive.org/help/archive-bittorrents/
https://github.com/jjjake/internetarchive
https://archive.org/services/docs/api/internetarchive/cli.ht...
u/stavros wrote a design doc for a system (codename "Elephant") that would scale this up: https://news.ycombinator.com/item?id=45559219
(no affiliation, I am just a rando; if you are a library, museum, or similar institution, ask IA to drop some racks at your colo for replication, and as always, don't forget to donate to IA when able to and be kind to their infrastructure)