logoalt Hacker News

londons_explore05/15/20252 repliesview on HN

Most of these unauthenticated requests are read-only.

All of public github is only 21TB. Can't they just host that on a dumb cache and let the bots crawl to their heart's content?


Replies

yorwba05/15/2025

I guess you're getting the size from the Arctic Code Vault? https://github.blog/news-insights/company-news/github-archiv... That was 5 years ago and is presumably in git's compressed storage format. Caching the corresponding GitHub HTML would take significantly more.

TheDong05/15/2025

You're talking about the 21TB captured to the arctic code vault, but that 21TB isn't "all of public github"

Quoting from https://archiveprogram.github.com/arctic-vault/

> every *active* public GitHub repository. [active meaning any] repo with any commits between [2019-11-13 and 2020-02-02 ...] The snapshot consists of the HEAD of the default branch of each repository, minus any binaries larger than 100KB in size

So no files larger than 100KB, no commit history, no issues or PR data, no other git metadata.

If we look at this blog post from 2022, the number we get is 18.6 PB for just git data https://github.blog/engineering/architecture-optimization/sc...

Admittedly, that includes private repositories too, and there's no public number for just public repositories, but I'm certain it's at least a noticeable fraction of that ~19PB.

show 1 reply