logoalt Hacker News

gkbrkyesterday at 5:51 PM3 repliesview on HN

My Hacker News items table in ClickHouse has 47,428,860 items, and it's 5.82 GB compressed and 18.18 GB uncompressed. What makes Parquet compression worse here, when both formats are columnar?


Replies

0cf8612b2e1eyesterday at 6:00 PM

Sorting, compression algorithm +level, and data types can all have an impact. I noted elsewhere that a Boolean is getting represented as an integer. That’s one bit vs 1-4 bytes.

There is also flexibility in what you define as the dataset. Skinnier, but more focused tables could be space saving vs a wide table that covers everything -will probably break compressible runs of data.

xnxyesterday at 5:57 PM

Parquet has a few compression option. Not sure which one they are using.

show 1 reply
boznzyesterday at 9:20 PM

.. and Remove all the political shit-slop since COVID/AI and it's probably under a gig.

show 1 reply