Am I misunderstanding what a parquet file is, or are all of the HN posts along with the embedding metadata a total of 55GB?
you'd be surprised. I have a lot of text data and Parquet files with brotli compression can achieve impressive file sizes.
Around 4 millions of web pages as markdown is like 1-2GB
based on the table they show, that would be my inclination
wanted to do this for my own upvotes so I can see the kind of things I like, or find them again easier or when relevant
Compressed, pretty believable.
I imagine that's mostly embeddings actually. My database has all the posts and comments from Hacker News, and the table takes up 17.68 GB uncompressed and 5.67 GB compressed.