logoalt Hacker News

bravurayesterday at 1:00 PM2 repliesview on HN

Can you explain your use-case and why DuckDB is better?

Considering switching from pandas and want to understand what is my best bet. I am just processing feature vectors that are too large for memory, and need an initial simple JOIN to aggregate them.


Replies

rapatel0yesterday at 4:30 PM

Look into [Ibis](https://ibis-project.org/). It's a dataframe library built on duckdb. It supports lazy execution, greater than memory datastructures, remote s3 data and is insanely fast. Also works with basically any backend (postgres, mysql, parquet/csv files, etc) though there are some implementation gaps in places.

I previously had a pandas+sklearn transformation stack that would take up to 8 hours. Converted it to ibis and it executes in about 4 minutes now and doesn't fill up RAM.

It's not a perfect apples to apples pandas replacement but really a nice layer on top of sql. after learning it, I'm almost as fast as I was on pandas with expressions.

show 1 reply
sega_saiyesterday at 1:37 PM

I am not necessarily saying duckdb is better. I personally just found it easier, clearer to write a sql query for any complicated set of joins/group by processing than to try to do that in pandas.