logoalt Hacker News

catapartyesterday at 7:40 PM2 repliesview on HN

Wow! That's a really great point of reference. I always knew text-based social media(ish) stuff should be "small", but I never had any idea if that meant a site like HN could store it's content in 1-2 TB, or if it was more like a few hundred gigs or what. To learn that it's really only tens of gigs is very surprising!


Replies

ndriscollyesterday at 7:52 PM

Scraped reddit text archives (~23B items according to their corporate info page) are ~4 TB of compressed json, which includes metadata and not just the actual comment text.

osigurdsonyesterday at 7:46 PM

I suspect the text alone would be a lot smaller. Embeddings add a lot - 4K or more regardless of the size of the text.