It's a bit sad for me. I find the biggest issue for me with pandas is the API, not the speed.

rich_sasha • 11/20/2024 • 22 replies • view on HN

So many foot guns, poorly thought through functions, 10s of keyword arguments instead of good abstractions, 1d and 2d structures being totally different objects (and no higher-order structures). I'd take 50% of the speed for a better API.

I looked at Polars, which looks neat, but seems made for a different purpose (data pipelines rather than building models semi-interactively).

To be clear, this library might be great, it's just a shame for me that there seems no effort to make a Pandas-like thing with better API. Maybe time to roll up my sleeves...

Replies

stared • 11/20/2024

Yes, every time I write df[df.sth = val], a tiny part of me dies.

For a comparison, dplyr offers a lot of elegant functionality, and the functional approach in Pandas often feels like an afterthought. If R is cleaner than Python, it tells a lot (as a side note: the same story for ggplot2 and matplotlib).

Another surprise for friends coming from non-Python backgrounds is the lack of column-level type enforcement. You write df.loc[:, "col1"] and hope it works, with all checks happening at runtime. It would be amazing if Pandas integrated something like Pydantic out of the box.

I still remember when Pandas first came out—it was fantastic to have a tool that replaced hand-rolled data structures using NumPy arrays and column metadata. But that was quite a while ago, and the ecosystem has evolved rapidly since then, including Python’s gradual shift toward type checking.

➕ show 3 replies

movpasd • 11/20/2024

I started using Polars for the "rapid iteration" usecase you describe, in notebooks and such, and haven't looked back — there are a few ergonomic wrinkles that I mostly attribute to the newness of the library, but I found that polars forces me to structure my thought process and ask myself "what am I actually trying to do here?".

I find I basically never write myself into a corner with initially expedient but ultimately awkward data structures like I often did with pandas, the expression API makes the semantics a lot clearer, and I don't have to "guess" the API nearly as much.

So even for this usecase, I would recommend trying out polars for anyone reading this and seeing how it feels after the initial learning phase is over.

ljosifov • 11/20/2024

+1 Seconding this. My limited experience with pandas had a non-trivial number of moments "?? Is it really like this? Nah - I'm mistaken for sure, this can not be, no one would do something insane like that". And yet and yet... Fwiw since I've found that numpy is a must (ofc), but pandas is mostly optional. So I stick to numpy for my writing, and keep pandas read only. (just execute someone else's)

➕ show 1 reply

egecant • 11/20/2024

Completely agree, from the perspective of someone that primarily uses R/tidyverse for data wrangling, there is this great article on why Pandas API feel clunky: https://www.sumsar.net/blog/pandas-feels-clunky-when-coming-...

paddy_m • 11/20/2024

Have you tried polars? It’s a much more regular syntax. The regular syntax fits well with the lazy execution. It’s very composable for programmatically building queries. And then it’s super fast

➕ show 1 reply

amelius • 11/20/2024

Yes. Pandas turns 10x developers into .1x developers.

➕ show 1 reply

sega_sai • 11/20/2024

Great point that I completely share. I tend to avoid pandas at all costs except for very simple things as I have bitten by many issues related to indexing. For anything complicated I tend to switch to duckdb instead.

➕ show 1 reply

martinsmit • 11/20/2024

Check out redframes[1] which provides a dplyr-like syntax and is fully interoperable with pandas.

[1]: https://github.com/maxhumber/redframes

➕ show 1 reply

h14h • 11/20/2024

If you wanna try a different API, take a look at Elixir Explorer:

https://hexdocs.pm/explorer/exploring_explorer.html

It runs on top of Polars so you get those speed gains, but uses the Elixir programming language. This gives the benefit of a simple finctional syntax w/ pipelines & whatnot.

It also benefits from the excellent Livebook (a Jupyter alternative specific to Elixir) ecosystem, which provides all kinds of benefits.

faizshah • 11/20/2024

Pandas is a commonly known DSL at this point so lots of data scientists know pandas like the back of their hand and thats why a lot of pandas but for X libraries have become popular.

I agree that pandas does not have the best designed api in comparison to say dplyr but it also has a lot of functionality like pivot, melt, unstack that are often not implemented by other libraries. It’s also existed for more than a decade at this point so there’s a plethora of resources and stackoverflow questions.

On top of that, these days I just use ChatGPT to generate some of my pandas tasks. ChatGPT and other coding assistants know pandas really well so it’s super easy.

But I think if you get to know Pandas after a while you just learn all the weird quirks but gain huge benefits from all the things it can do and all the other libraries you can use with it.

➕ show 1 reply

omnicognate • 11/20/2024

What about the polars API doesn't work well for your use case?

➕ show 1 reply

otsaloma • 11/20/2024

Agreed, never had a problem with the speed of anything NumPy or Arrow based.

Here's my alternative: https://github.com/otsaloma/dataiter https://dataiter.readthedocs.io/en/latest/_static/comparison...

Planning to switch to NumPy 2.0 strings soon. Other than that I feel all the basic operations are fine and solid.

Note for anyone else rolling up their sleeves: You can get quite far with pure Python when building on top of NumPy (or maybe Arrow). The only thing I found needing more performance was group-by-aggregate, where Numba seems to work OK, although a bit difficult as a dependency.

epistasis • 11/20/2024

Have you examined siuba at all? It promises to be more similar to the R tidyverse, which IMHO has a much better API. And I personally prefer dplyr/tidyverse to Polars for exploratory analysis.

https://siuba.org

I have not yet used siuba, but would be interested in others' opinions. The activation energy to learn a new set of tools is so large that I rarely have the time to fully examine this space...

➕ show 2 replies

kussenverboten • 11/20/2024

Agree with this. My favorite syntax is the elegance of data.table API in R. This should be possible in Python too someday.

te_chris • 11/20/2024

Pandas best feature for me is the df format being readable by duckdb. The filtering api is a nightmare

fluorinerocket • 11/20/2024

Thank you I don't know why people think it's so amazing. I end up sometimes just extracting the numpy arrays from the data frame and doing things like I know how to, because the Panda way is so difficult

randomuser45678 • 11/21/2024

Check out https://ibis-project.org/

stainablesteel • 11/21/2024

i fell on dark days when they changed the multiindex reference level=N, which worked perfectly and was so logical and could be input alongside the axis, was swapped out in favor of a separate call for groupby

wodenokoto • 11/20/2024

In that case I’d recommend dplyr in R. It also integrates with a better plotting library, GGPlot, which not only gives you better API than matplotlib but also prettier plots (unless you really get to work at your matplot code)

adolph • 11/20/2024

So many foot guns, poorly thought through functions, 10s of keyword arguments instead of good abstractions

Yeah, Pandas has that early PHP feel to it, probably out of being a successful first mover.

nathan_compton • 11/20/2024

Yeah. Pandas is the worst. Polars is better in some ways but so verbose!

Kalanos • 11/20/2024

The pandas API makes a lot more sense if you are familiar with numpy.

Writing pandas code is a bit redundant. So what?

Who is to say that fireducks won't make their own API?

alt Hacker News

Replies