logoalt Hacker News

Inside The Internet Archive's Infrastructure

156 pointsby dvrpyesterday at 7:26 AM27 commentsview on HN

https://github.com/internetarchive/heritrix3


Comments

hedoratoday at 7:31 PM

It's frustrating that there's no way for people to (selectively) mirror the Internet Archive. $25-30M per year is a lot for a non-profit, but it's nothing for government agencies, or private corporations building Gen AI models.

I suspect having a few different teams competing (for funding) to provide mirrors would rapidly reduce the hardware cost too.

The density + power dissipation numbers quoted are extremely poor compared to enterprise storage. Hardware costs for the enterprise systems are also well below AWS (even assuming a short 5 year depreciation cycle on the enterprise boxes). Neither this article nor the vendors publish enough pricing information to do a thorough total cost of ownership analysis, but I can imagine someone the size of IA would not be paying normal margins to their vendors.

show 5 replies
BryantDtoday at 6:20 PM

They have come a very long way since the late 1990s when I was working there as a sysadmin and the data center was a couple of racks plus a tape robot in a back room of the Presidio office with an alarmingly slanted floor. The tape robot vendor had to come out and recalibrate the tape drives more often than I might have wanted.

show 2 replies
mcpar-landtoday at 8:52 PM

Is this some kind of copypasted AI output? There are unformatted footnote numbers at the end of many sentences.

show 1 reply
rarismatoday at 9:46 PM

I think this was writen wholly by deep research.

It just reads like a clunky low quality article

cowhaxtoday at 7:39 PM

>And the rising popularity of generative AI adds yet another unpredictable dimension to the future survival of the public domain archive.

I'd say the nonprofit has found itself a profitable reason for its existence

lysacetoday at 9:02 PM

The IA needs perhaps not just more money, but also more talented people, IMO. I worry that it has stagnated, from a tech pov.

show 1 reply
brcmthrowawaytoday at 6:50 PM

Does IA do deduplication?

show 2 replies
schmuckonwheelstoday at 8:12 PM

Disappointed with the lack of pictures.

show 1 reply
brcmthrowawaytoday at 6:49 PM

[flagged]

kruncktoday at 9:25 PM

[flagged]

show 1 reply