logoalt Hacker News

slaulast Sunday at 9:48 PM3 repliesview on HN

A Parquet file compactor. I have a client whose data lakes are partitioned by date, and obviously they end up with thousands of files all containing single/dozens/thousands of rows.

I’d estimate 30-40% of their S3 bill could be eliminated just by properly compacting and sorting the data. I took it as an opportunity to learn DuckDB, and decided to build a tool that does this. I’ll release it tomorrow or Tuesday as FOSS.


Replies