logoalt Hacker News

pehejeyesterday at 8:42 AM1 replyview on HN

the data is dominated by big unique TEXT columns, unsure how that can much compress better when grouped - but would be interesting to know


Replies

3eb7988a1663yesterday at 4:07 PM

I was thinking more the numeric columns which have pre-built compression mechanisms to handle incrementing columns or long runs of identical values. For sure less total data than the text, but my prior is that the two should perform equivalently on the text, so the better compression on numbers should let duckdb pull ahead.

I had to run a test for myself, and using sqlite2duckdb (no research, first search hit), and using randomly picked shard 1636, the sqlite.gz was 4.9MB, but the duckdb.gz was 3.7MB.

The uncompressed sizes favor sqlite, which does not make sense to me, so not sure if duckdb keeps around more statistics information. Uncompressed sqlite 12.9MB, duckdb 15.5MB