logoalt Hacker News

tjcheartoday at 4:14 PM6 repliesview on HN

I’ve not used duckdb before nor do I do much data analysis so I am curious about this one aspect of processing medium sized json/csv with it: the data are not indexed, so any non-trivial query would require a full scan. Is duckdb so fast that this is never really a problem for most folks?


Replies

RobinLtoday at 5:20 PM

It is true that for json and csv you need a full scan but there are several mitigations.

The first is simply that it's fast - for example, DuckDB has one of the best csv readers around, and it's parallelised.

Next, engines like DuckDB are optimised for aggregate analysis, where your single query processes a lot of rows (often a significant % of all rows). That means that a full scan is not necessarily as big a problem as it first appears. It's not like a transactional database where often you need to quickly locate and update a single row out of millions.

In addition, engines like DuckDB have predicate pushdown so if your data is stored in parquet format, then you do not need to scan every row because the parquet files themselves hold metadata about the values contained within the file.

Finally, when data is stored in formats like parquet, it's a columnar format, so it only needs to scan the data in that column, rather than needing to process the whole row even though you may be only interested in one or two columns

ayhanfuattoday at 5:18 PM

If you are going to query it frequently then json/csv might become an issue. I think the reason it doesn't become a problem for duckdb/polars users is that we generally convert them to parquet after first read.

biophysboytoday at 4:24 PM

Zonemaps are created for columns automatically. I process somewhat large tables w/ duckdb regularly (100M rows) and never have any problems.

show 1 reply
simlevesquetoday at 4:48 PM

But when indexing your json or csv, if you have say 10 rows, each row is separated on your disk instead of all together. So a scan for one columb only needs to read a tenth of the disk space used for the data. Obviously this depends on the columns' content.

show 1 reply
akhundelartoday at 4:26 PM

Not a duckdb user, but I use polars a lot (mentioned in the article).

Depends on your definition of medium sized, but for tables of hundreds of thousands of rows and ~30 columns, these tools are fast enough to run queries instantly or near instantly even on laptop CPUs.

mpalmertoday at 4:28 PM

I guess the question is: how much is medium? DuckDB can handle quite a lot of data without breaking a sweat. Certainly if you prefer writing SQL for certain things, it's a no-brainer.