A Parquet file compactor. I have a client whose data lakes are partitioned by date, and obviously they end up with thousands of files all containing single/dozens/thousands of rows.
I’d estimate 30-40% of their S3 bill could be eliminated just by properly compacting and sorting the data. I took it as an opportunity to learn DuckDB, and decided to build a tool that does this. I’ll release it tomorrow or Tuesday as FOSS.
Load the data into MergeTree instead? https://clickhouse.com/docs/engines/table-engines/mergetree-...
Published here: https://codeberg.org/unticks/comparqter