Inside The Internet Archive's Infrastructure

456 points • by dvrp • 01/14/2026 • 119 comments • view on HN

https://github.com/internetarchive/heritrix3

Comments

hedora • 01/15/2026

It's frustrating that there's no way for people to (selectively) mirror the Internet Archive. $25-30M per year is a lot for a non-profit, but it's nothing for government agencies, or private corporations building Gen AI models.

I suspect having a few different teams competing (for funding) to provide mirrors would rapidly reduce the hardware cost too.

The density + power dissipation numbers quoted are extremely poor compared to enterprise storage. Hardware costs for the enterprise systems are also well below AWS (even assuming a short 5 year depreciation cycle on the enterprise boxes). Neither this article nor the vendors publish enough pricing information to do a thorough total cost of ownership analysis, but I can imagine someone the size of IA would not be paying normal margins to their vendors.

➕ show 11 replies

BryantD • 01/15/2026

They have come a very long way since the late 1990s when I was working there as a sysadmin and the data center was a couple of racks plus a tape robot in a back room of the Presidio office with an alarmingly slanted floor. The tape robot vendor had to come out and recalibrate the tape drives more often than I might have wanted.

➕ show 2 replies

mcpar-land • 01/15/2026

Is this some kind of copypasted AI output? There are unformatted footnote numbers at the end of many sentences.

➕ show 5 replies

rarisma • 01/15/2026

I think this was writen wholly by deep research.

It just reads like a clunky low quality article

➕ show 1 reply

alfgrimur • 01/16/2026

I love to imagine this is all a cover and the Internet Archive is located in a remote cave in northern Sweden and consists of a series of endlessly self replicating flash drives powered by the sun.

semiquaver • 01/15/2026

This article is way too LLMey for my taste.

bpiche • 01/15/2026

IA is hosting a couple more of Rick Prelinger’s shows this month. Looking forward to visiting

brcmthrowaway • 01/15/2026

Does IA do deduplication?

➕ show 2 replies

ThinkBeat • 01/16/2026

Wow that piece of real-estate has to cost a bundle.

fedeb95 • 01/16/2026

Thanks for this, I've always wondered how the Archive operates but always ended up not searching.

initialg • 01/16/2026

Is it still year 2006 and websites haven’t figured out responsive design?

ghm2199 • 01/15/2026

Does any one know how the size of this compares to archive.today?

➕ show 1 reply

vladiim • 01/15/2026

How long will it take for them to send the PetaBox to space?

➕ show 1 reply

lysace • 01/15/2026

The IA needs perhaps not just more money, but also more talented people, IMO. I worry that it has stagnated, from a tech pov.

➕ show 3 replies

bilater • 01/16/2026

I have always wondered how archives manage to capture screenshots of paywalled pages like the New York Times or the Wall Street Journal. Do they have agreements with publishers, do their crawlers have special privileges to bypass detection, or do they use technology so advanced that companies cannot detect them?

schmuckonwheels • 01/15/2026

Disappointed with the lack of pictures.

➕ show 2 replies

jarboot • 01/16/2026

Hate to be the guy in the comments complaining about the css, but the sides of the text of this article are cut off. It looks like I'm zoomed in, and there's no way I can see the first few columns of the text without going to Reader view. I'm on a modern iPhone using safari, accessibility settings font larger than usual.

➕ show 3 replies

segalord • 01/16/2026

this is every data hoarders dream setup haha

brcmthrowaway • 01/15/2026

[flagged]

krunck • 01/15/2026

[flagged]

➕ show 1 reply

cowhax • 01/15/2026

>And the rising popularity of generative AI adds yet another unpredictable dimension to the future survival of the public domain archive.

I'd say the nonprofit has found itself a profitable reason for its existence

alt Hacker News

Inside The Internet Archive's Infrastructure

Comments