Why DuckDB is my first choice for data processing

77 points • by tosh • today at 10:57 AM • 28 comments • view on HN

Comments

What I love about duckdb:

-- Support for .parquet, .json, .csv (note: Spotify listening history comes in a multiple .json files, something fun to play with).

-- Support for glob reading, like: select * from 'tsa20*.csv' - so you can read hundreds of files (any type of file!) as if they were one file.

-- if the files don't have the same schema, union_by_name is amazing.

-- The .csv parser is amazing. Auto assigns types well.

-- It's small! The Web Assembly version is 2mb! The CLI is 16mb.

-- Because it is small you can add duckdb directly to your product, like Malloy has done: https://www.malloydata.dev/ - I think of Malloy as a technical persons alternative to PowerBI and Tableau, but it uses a semantic model that helps AI write amazing queries on your data. Edit: Malloy makes SQL 10x easier to write because of its semantic nature. Malloy transpiles to SQL, like Typescript transpiles to Javascript.

➕ show 1 reply

film42 • today at 5:51 PM

Just 10 minutes ago I was working with a very large semi-malformed excel file generated by a mainframe. DuckDB was able to load it with all_varchar (just keep everything a string) in under a second.

I'm still waiting for Excel to load the file.

majkinetor • today at 5:49 PM

Anybody with experience in using duckdb to quickly select page of filtered transactions from the single table having a couple of billions of records and let's say 30 columns where each can be filtered using simple WHERE clausule? Lets say 10 years of payment order data. I am wondering since this is not analytical scenario.

Doing that in postgres takes some time, and even simple count(*) takes a lot of time (with all columns indexed)

noo_u • today at 4:46 PM

I'd say the author's thoughts are valid for basic data processing. Outside of that, most of claims in this article, such as:

"We're moving towards a simpler world where most tabular data can be processed on a single large machine1 and the era of clusters is coming to an end for all but the largest datasets."

become very debatable. Depending on how you want to pivot/ scale/augment your data, even datasets that seemingly "fit" on large boxes will quickly OOM you.

The author also has another article where they claim that:

"SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable." (over polars/pandas etc)

This does not map to my experience at all, outside of the realm of nicely parsed datasets that don't require too much complicated analysis or augmentation.

➕ show 2 replies

lvl155 • today at 5:20 PM

I use DuckDB for nearly everything data related because it’s simply the most scalable format/tool/lib from a handful of rows to billions. It can handle it and I don’t need to do anything special on top. I’ve been a big proponent of doing everything in SQL for awhile now. It’s the most recyclable AND least error prone way of working with data.

tjchear • today at 4:14 PM

I’ve not used duckdb before nor do I do much data analysis so I am curious about this one aspect of processing medium sized json/csv with it: the data are not indexed, so any non-trivial query would require a full scan. Is duckdb so fast that this is never really a problem for most folks?

➕ show 6 replies

biophysboy • today at 4:28 PM

I think my favorite part of duckdb is its flexibility. Its such a handly little swiss army knife for doing analytical processing in scientific environments (messy data w/ many formats).

DangitBobby • today at 2:20 PM

Being able to use SQL on CSV and json/jsonl files is pretty sweet. Of course it does much more than that, but that's what I do most often with it. Love duckdb.

➕ show 1 reply

oulu2006 • today at 2:27 PM

That's really interesting, I love the idea of being able to use columnar support directly within postgresql.

I was thinking of using Citus for this, but possibly using duckdb is a better way to do. Citus comes with a lot more out of the box but duckdb could be a good stepping stone.

➕ show 1 reply

s-a-p • today at 4:32 PM

"making DuckDB potentially a suitable replacement for lakehouse formats such as Iceberg or Delta lake for medium scale data" > I'm a Data Engineering noob, but DuckDB alone doesn't do metadata & catalog management, which is why they've also introduce DuckLake.

Related question, curious as to your experience with DuckLake if you've used it. I'm currently setting up s3 + Iceberg + duckDB for my company (startup) and was wondering what to pick between Iceberg and DuckLake.

➕ show 2 replies

smithclay • today at 4:06 PM

Agree with the author, will add: duckdb is an extremely compelling choice if you’re a developer and want to embed analytics in your app (which can also run in a web browser with wasm!)

Think this opens up a lot of interesting possibilities like more powerful analytics notebooks like marimo (https://marimo.io/) … and that’s just one example of many.

➕ show 1 reply

clumsysmurf • today at 3:53 PM

DuckDB has experimental builds for Android ... I'm wondering how much work it would take to implement a Java API for it similar to sqlite (Cursor, etc).

➕ show 2 replies

alt Hacker News

Why DuckDB is my first choice for data processing

Comments