A Parquet file compactor. I have a client whose data lakes are partitioned by date, and obviously th...

slau • last Sunday at 9:48 PM • 3 replies • view on HN

A Parquet file compactor. I have a client whose data lakes are partitioned by date, and obviously they end up with thousands of files all containing single/dozens/thousands of rows.

I’d estimate 30-40% of their S3 bill could be eliminated just by properly compacting and sorting the data. I took it as an opportunity to learn DuckDB, and decided to build a tool that does this. I’ll release it tomorrow or Tuesday as FOSS.

Replies

slau • yesterday at 2:01 PM

Published here: https://codeberg.org/unticks/comparqter

zX41ZdbW • last Monday at 1:50 AM

Load the data into MergeTree instead? https://clickhouse.com/docs/engines/table-engines/mergetree-...

alt Hacker News

Replies