FireDucks: Pandas but Faster

398 points • by sebg • 11/14/2024 • 179 comments • view on HN

Comments

OutOfHere • 11/20/2024

Don't use it:

> By providing the beta version of FireDucks free of charge and enabling data scientists to actually use it, NEC will work to improve its functionality while verifying its effectiveness, with the aim of commercializing it within FY2024.

In other words, it's free only to trap you.

➕ show 5 replies

rich_sasha • 11/20/2024

It's a bit sad for me. I find the biggest issue for me with pandas is the API, not the speed.

So many foot guns, poorly thought through functions, 10s of keyword arguments instead of good abstractions, 1d and 2d structures being totally different objects (and no higher-order structures). I'd take 50% of the speed for a better API.

I looked at Polars, which looks neat, but seems made for a different purpose (data pipelines rather than building models semi-interactively).

To be clear, this library might be great, it's just a shame for me that there seems no effort to make a Pandas-like thing with better API. Maybe time to roll up my sleeves...

➕ show 22 replies

omnicognate • 11/20/2024

> Then came along Polars (written in Rust, btw!) which shook the ground of Python ecosystem due to its speed and efficiency

Polars rocked my world by having a sane API, not by being fast. I can see the value in this approach if, like the author, you have a large amount of pandas code you don't want to rewrite, but personally I'm extremely glad to be leaving the pandas API behind.

➕ show 1 reply

bratao • 11/20/2024

Unfortunately it is not Opensource yet - https://github.com/fireducks-dev/fireducks/issues/22

➕ show 2 replies

imranq • 11/20/2024

This presentation does a good job distilling why FireDucks is so fast:

https://fireducks-dev.github.io/files/20241003_PyConZA.pdf

The main reasons are

* multithreading

* rewriting base pandas functions like dropna in c++

* in-built compiler to remove unused code

Pretty impressive especially given you import fireducks.pandas as pd instead of import pandas as pd, and you are good to go

However I think if you are using a pandas function that wasn't rewritten, you might not see the speedups

➕ show 1 reply

ayhanfuat • 11/20/2024

In its essence it is a commercial product which has a free trial.

> Future Plans By providing the beta version of FireDucks free of charge and enabling data scientists to actually use it, NEC will work to improve its functionality while verifying its effectiveness, with the aim of commercializing it within FY2024.

➕ show 1 reply

safgasCVS • 11/20/2024

I’m sad that R’s tidy syntax is not copied more widely in the python world. Dplyr is incredibly intuitive most don’t ever bother reading the instructions you can look at a handful of examples and you’ve got the gist of it. Polars despite its speed is still verbose and inconsistent while pandas is seemingly a collection of random spells.

➕ show 1 reply

short_sells_poo • 11/20/2024

Looks very cool, BUT: it's closed source? That's an immediate deal breaker for me as a quant. I'm happy to pay for my tools, but not being able to look and modify the source code of a crucial library like this makes it a non-starter.

__mharrison__ • 11/20/2024

Lots of Pandas hate in this thread. However, for folks with lots of lines of Pandas in production, Fireducks can be a lifesaver.

I've had the chance to play with it on some of my code it queries than ran in 8+ minutes come down to 20 seconds.

Re-writing in Polars involves more code changes.

However, with Pandas 2.2+ and arrow, you can use .pipe to move data to Polars, run the slow computation there, and then zero copy back to Pandas. Like so...

    (df
     # slow part
     .groupby(...)
     .agg(...)
    )

to:

    def polars_agg(df):
      return (pl.from_pandas(df)
        .group_by(...)
        .agg(...)
        .to_pandas()
      )

    (df
      .pipe(polars_agg)
    )

➕ show 1 reply

flakiness • 11/20/2024

> FireDucks is released on pypi.org under the 3-Clause BSD License (the Modified BSD License).

Where can I find the code? I don't see it on GitHub.

> [email protected]

So it's from NEC (a major Japanese computer company), presumably a research artifact?

> https://fireducks-dev.github.io/docs/about-us/ Looks like so.

viraptor • 11/20/2024

> 100% compatibility with existing Pandas code: check.

Is it actually? Do people see that level of compatibility in practice?

➕ show 1 reply

ssivark • 11/20/2024

Setting aside complaints about the Pandas API, it's frustrating that we might see the community of a popular "standard" tool fragment into two or even three ecosystems (for libraries with slightly incompatible APIs) -- seemingly all with the value proposition of "making it faster". Based on the machine learning experience over the last decade, this kind of churn in tooling is somewhat exhausting.

I wonder how much of this is fundamental to the common approach of writing libraries in Python with the processing-heavy parts delegated to C/C++ -- that the expressive parts cannot be fast and the fast parts cannot be expressive. Also, whether Rust (for polars, and other newer generation of libraries) changes this tradeoff substantially enough.

➕ show 2 replies

liminal • 11/20/2024

Lots of people have mentioned Polars' sane API as the main reason to favor it, but the other crucial reason for us is that it's based on Apache Arrow. That allows us to use it where it's the best tool and then switch to whatever else we need when it isn't.

DonHopkins • 11/20/2024

FireDucks FAQ:

Q: Why do ducks have big flat feet?

A: So they can stomp out forest fires.

Q: Why do elephants have big flat feet?

A: So they can stomp out flaming ducks.

adrian17 • 11/20/2024

Any explanation what makes it faster than pandas and polars would be nice (at least something more concrete than "leverage the C engine").

My easy guess is that compared to pandas, it's multi-threaded by default, which makes for an easy perf win. But even then, 130-200x feels extreme for a simple sum/mean benchmark. I see they are also doing lazy evaluation and some MLIR/LLVM based JIT work, which is probably enough to get an edge over polars; though its wins over DuckDB _and_ Clickhouse are also surprising out of nowhere.

Also, I thought one of the reasons for Polars's API was that Pandas API is way harder to retrofit lazy evaluation to, so I'm curious how they did that.

Kalanos • 11/20/2024

Linux only right now https://github.com/fireducks-dev/fireducks/issues/27

breakds • 11/20/2024

I understand `pandas` is widely used in finance and quantitative trading, but it does not seem to be the best fit especially when you want your research code to be quickly ported to production.

We found `numpy` and `jax` to be a good trade-off between "too high level to optimize" and "too low level to understand". Therefore in our hedge fund we just build data structures and helper functions on top of them. The downside of the above combination is on sparse data, for which we call wrapped c++/rust code in python.

uptownfunk • 11/20/2024

If they could just make a dplyr for py it would be so awesome. But sadly I don’t think the python language semantics will support such a tool. It all comes down to managing the namespace I guess

xbar • 11/20/2024

Great work, but I will hold my adoption until c++ source is available.

rcarmo • 11/20/2024

The killer app for Polars in my day-to-day work is its direct Parquet export. It's become indispensable for cleaning up stuff that goes into Spark or similar engines.

__mharrison__ • 11/20/2024

Many of the complaints about Pandas here (and around the internet) are about the weird API. However, if you follow a few best practices, you never run into the issue folks are complaining about.

I wrote a nice article about chaining for Ponder. (Sadly, it looks like the Snowflake acquisition has removed that. My book, Effective Pandas 2, goes deep into my best practices.)

➕ show 1 reply

pplonski86 • 11/20/2024

How does it compare to Polars?

EDIT: I've found some benchmarks https://fireducks-dev.github.io/docs/benchmarks/

Would be nice to know what are internals of FireDucks

Kalanos • 11/20/2024

Regarding compatibility, fireducks appears to be using the same column dtypes:

```

>>> df['year'].dtype == np.dtype('int32')

True

```

thecleaner • 11/20/2024

Sure but single node performance. This makes it not very useful IMO since quite a few data science folks work with Hadoop clusters or Snowflake clusters or DataBricks where data is distributed and querying is handled by Spark executors.

➕ show 2 replies

cmcconomy • 11/20/2024

Every time I see a new better pandas, I check to see if it has geopandas compatibility

benrutter • 11/20/2024

Anyone here tried using FireDucks?

The promise of a 100x speedup with 0 changes to your codebase is pretty huge, but even a few correctness / incompatibility issues would probably make it a no-go for a bunch of potential users.

i_love_limes • 11/20/2024

I have never heard of FireDucks! I'm curious if anyone else here has used it. Polars is nice, but it's not totally compatible. It would be interesting how much faster it is for more complex calculations

softwaredoug • 11/20/2024

The biggest advantage of pandas is its extensibility. If you care about that, it’s (relatively) easy to add your own extension array type.

I haven’t seen that in other system like Polars, but maybe I’m wrong.

caycep • 11/20/2024

Just because I haven't jumped into the data ecosystem for a while - is Polars basically the same as Pandas but accelerated? Is Wes still involved in either?

PhasmaFelis • 11/20/2024

"FireDucks: Pandas but Faster" sounds like it's about something much more interesting than a Python library. I'd like to read that article.

dkga • 11/20/2024

Reading all pandas vs polars reminded me of the tidyverse vs data.table discussion some 10 years ago.

hinkley • 11/20/2024

TIL that NEC still exists. Now there’s a name I have not heard in a long, long time.

insane_dreamer • 11/20/2024

surprised not to see any mention of numpy (our go-to) here

edit: I know pandas uses numpy under the hood, but "raw" numpy is typically faster (and more flexible), so curious as to why it's not mentioned

E_Bfx • 11/20/2024

Very impressive, the Python ecosystem is slowly getting very good.

➕ show 2 replies

gigatexal • 11/20/2024

On average only 1.5x faster than polars. That’s kinda crazy.

➕ show 1 reply

Gepsens • 11/21/2024

It'll be polars and datafusion for me thanks

nooope6 • 11/20/2024

Pretty cool, but where's the source at?

KameltoeLLM • 11/20/2024

Shouldn't that be FirePandas then?

alt Hacker News

FireDucks: Pandas but Faster

Comments