Author here. Re: 'SQL should be the first option considered', there are certainly advanta...

RobinL • today at 5:32 PM • 2 replies • view on HN

Author here. Re: 'SQL should be the first option considered', there are certainly advantages to other dataframe APIs like pandas or polars, and arguably any one is better in the moment than SQL. At the moment Polars is ascendent and it's a high quality API.

But the problem is the ecosystem hasn't standardised on any of them, and it's annoying to have to rewrite pipelines from one dataframe API.

I also agree you're gonna hit OOM if your data is massive, but my guess is the vast majority of tabular data people process is <10GB, and that'll generally process fine on a single large machine. Certainly in my experience it's common to see Spark being used on datasets that are no where big enough to need it. DuckDB is gaining traction, but a lot of people still seem unaware how quickly you can process multiple GB of data on a laptop nowadays.

I guess my overall position is it's a good idea to think about using DuckDB first, because often it'll do the job quickly and easily. There are a whole host of scenarios where it's inappropriate, but it's a good place to start.

Replies

noo_u • today at 6:04 PM

I think both of us are ultimately wary of using the wrong tool for the job.

I see your point, even though my experience has been somewhat the opposite. E.g. a pipeline that used to work fast enough/at all up until some point in time because the scale of the data or requirements allowed it. Then some subset of these conditions changes, the pipeline cannot meet them, and one has to reverse engineer obscure SQL views/stored procedures/plugins, and migrate the whole thing to python or some compiled language.

I work with high density signal data now, and my SQL knowledge occupies the "temporary solution" part of my brain for the most part.

chaps • today at 6:02 PM

Yeah, my experiences match yours and I very, very much work with messy data (FOIA data), though I use postgres instead of duckdb.

Most of the datasets I work with are indeed <10GB but the ones that are much larger follow the same ETL and analysis flows. It helps that I've built a lot of tooling to help with types and memory-efficient inserts. Having to rewrite pipelines because of "that one dataframe API" is exactly what solidified my thoughts around SQL over everything else. So much of my life time has been lost trying to get dataframe and non-dataframe libraries to work together.

Thing about SQL is that it can be taken just about anywhere, so the time spent improving your SQL skills is almost always well worth it. R and pandas much less so.

➕ show 1 reply

alt Hacker News

Replies