logoalt Hacker News

Internet Archive's Storage

227 pointsby zdwlast Tuesday at 5:16 PM65 commentsview on HN

Comments

dr_dshivtoday at 8:54 AM

> Li correctly points out that the Archive's budget, in the range of $25-30M/year, is vastly lower than any comparable website: By owning its hardware, using the PetaBox high-density architecture, avoiding air conditioning costs, and using open-source software, the Archive achieves a storage cost efficiency that is orders of magnitude better than commercial cloud rates.

That’s impressive. Wikipedia spends $185m per year and the Seattle public library spends $102m. Maybe not comparable exactly, but $30m per year seems inexpensive for the memory of the world…

show 6 replies
mrexroadtoday at 8:24 AM

> This "waste heat" system is a closed loop of efficiency. The 60+ kilowatts of heat energy produced by a storage cluster is not a byproduct to be eliminated but a resource to be harvested.

Are there any other data centers harvesting waste heat for benefit?

show 6 replies
arjietoday at 6:14 AM

This is very cool. One thing I am curious about is the software side of things and the details of the hardware. What is the filesystem and RAID (or lack of) layer to deal with this optimally? Looking into it a little:

* power budget dominates everything: I have access to a lot of rack hardware from old connections, but I don't want to put the army of old stuff in my cabinet because it will blow my power budget for not that much performance in comparison to my 9755. What disks does the IA use? Any specific variety or like Backblaze a large variety?

* magnetic is bloody slow: I'm not the Internet Archive so I'm just going to have a couple of machines with a few hundred TiB. I'm planning on making them all a big zfs so I can deduplicate but it seems like if I get a single disk failure I'm doomed to a massive rebuild

I'm sure I can work it out with a modern LLM, but maybe someone here has experience with actually running massive storage and the use-case where tomorrow's data is almost the same as today's - as is the case with the Internet Archive where tomorrow's copy of wiki.roshangeorge.dev will look, even at the block level, like yesterday's copy.

The last time I built with multi-petabyte datasets we were still using Hadoop on HDFS, haha!

arcade79today at 9:57 AM

While reading this kind of articles, I'm always surprised by how small the storage described is. Given that Microsoft released their paper on LRCs in 2012, Google patented a bunch in 2010, facebook talked about their stuff around the 2010-2014 era too. CEPH started getting good erasure codes around 2016-2020.

Has any of the big ones released articles on their storage systems in the last 5-10 years?

show 2 replies
ranger_dangertoday at 4:35 AM

I was hoping an article about IA's storage would go into detail about how their storage currently works, what kind of devices they use, how much they store, how quickly they add new data, the costs etc., but this seems to only talk about quite old stats.

show 2 replies
tylerchildstoday at 4:34 AM

Why’s Wendy’s Terracotta moved?

show 1 reply
badlibrariantoday at 6:38 AM

No climate control. No backup power. And it's secured by a wireless camera sitting in a potted plant. Bless them, but wow.

show 1 reply
JavohirXRtoday at 11:01 AM

I saw the word "delve" and already knew it was redacted or written by ai