logoalt Hacker News

cataparttoday at 6:32 PM4 repliesview on HN

Am I misunderstanding what a parquet file is, or are all of the HN posts along with the embedding metadata a total of 55GB?


Replies

gkbrktoday at 7:15 PM

I imagine that's mostly embeddings actually. My database has all the posts and comments from Hacker News, and the table takes up 17.68 GB uncompressed and 5.67 GB compressed.

show 2 replies
simlevesquetoday at 7:26 PM

you'd be surprised. I have a lot of text data and Parquet files with brotli compression can achieve impressive file sizes.

Around 4 millions of web pages as markdown is like 1-2GB

verdvermtoday at 6:43 PM

based on the table they show, that would be my inclination

wanted to do this for my own upvotes so I can see the kind of things I like, or find them again easier or when relevant

lazidetoday at 7:04 PM

Compressed, pretty believable.