logoalt Hacker News

rich_sashayesterday at 11:56 AM22 repliesview on HN

It's a bit sad for me. I find the biggest issue for me with pandas is the API, not the speed.

So many foot guns, poorly thought through functions, 10s of keyword arguments instead of good abstractions, 1d and 2d structures being totally different objects (and no higher-order structures). I'd take 50% of the speed for a better API.

I looked at Polars, which looks neat, but seems made for a different purpose (data pipelines rather than building models semi-interactively).

To be clear, this library might be great, it's just a shame for me that there seems no effort to make a Pandas-like thing with better API. Maybe time to roll up my sleeves...


Replies

staredyesterday at 3:03 PM

Yes, every time I write df[df.sth = val], a tiny part of me dies.

For a comparison, dplyr offers a lot of elegant functionality, and the functional approach in Pandas often feels like an afterthought. If R is cleaner than Python, it tells a lot (as a side note: the same story for ggplot2 and matplotlib).

Another surprise for friends coming from non-Python backgrounds is the lack of column-level type enforcement. You write df.loc[:, "col1"] and hope it works, with all checks happening at runtime. It would be amazing if Pandas integrated something like Pydantic out of the box.

I still remember when Pandas first came out—it was fantastic to have a tool that replaced hand-rolled data structures using NumPy arrays and column metadata. But that was quite a while ago, and the ecosystem has evolved rapidly since then, including Python’s gradual shift toward type checking.

show 3 replies
movpasdyesterday at 2:18 PM

I started using Polars for the "rapid iteration" usecase you describe, in notebooks and such, and haven't looked back — there are a few ergonomic wrinkles that I mostly attribute to the newness of the library, but I found that polars forces me to structure my thought process and ask myself "what am I actually trying to do here?".

I find I basically never write myself into a corner with initially expedient but ultimately awkward data structures like I often did with pandas, the expression API makes the semantics a lot clearer, and I don't have to "guess" the API nearly as much.

So even for this usecase, I would recommend trying out polars for anyone reading this and seeing how it feels after the initial learning phase is over.

egecantyesterday at 2:12 PM

Completely agree, from the perspective of someone that primarily uses R/tidyverse for data wrangling, there is this great article on why Pandas API feel clunky: https://www.sumsar.net/blog/pandas-feels-clunky-when-coming-...

ljosifovyesterday at 12:06 PM

+1 Seconding this. My limited experience with pandas had a non-trivial number of moments "?? Is it really like this? Nah - I'm mistaken for sure, this can not be, no one would do something insane like that". And yet and yet... Fwiw since I've found that numpy is a must (ofc), but pandas is mostly optional. So I stick to numpy for my writing, and keep pandas read only. (just execute someone else's)

show 1 reply
paddy_myesterday at 2:49 PM

Have you tried polars? It’s a much more regular syntax. The regular syntax fits well with the lazy execution. It’s very composable for programmatically building queries. And then it’s super fast

show 1 reply
martinsmityesterday at 12:01 PM

Check out redframes[1] which provides a dplyr-like syntax and is fully interoperable with pandas.

[1]: https://github.com/maxhumber/redframes

show 1 reply
ameliusyesterday at 12:35 PM

Yes. Pandas turns 10x developers into .1x developers.

show 1 reply
faizshahyesterday at 1:18 PM

Pandas is a commonly known DSL at this point so lots of data scientists know pandas like the back of their hand and thats why a lot of pandas but for X libraries have become popular.

I agree that pandas does not have the best designed api in comparison to say dplyr but it also has a lot of functionality like pivot, melt, unstack that are often not implemented by other libraries. It’s also existed for more than a decade at this point so there’s a plethora of resources and stackoverflow questions.

On top of that, these days I just use ChatGPT to generate some of my pandas tasks. ChatGPT and other coding assistants know pandas really well so it’s super easy.

But I think if you get to know Pandas after a while you just learn all the weird quirks but gain huge benefits from all the things it can do and all the other libraries you can use with it.

show 1 reply
sega_saiyesterday at 12:26 PM

Great point that I completely share. I tend to avoid pandas at all costs except for very simple things as I have bitten by many issues related to indexing. For anything complicated I tend to switch to duckdb instead.

show 1 reply
omnicognateyesterday at 12:07 PM

What about the polars API doesn't work well for your use case?

show 1 reply
h14hyesterday at 2:43 PM

If you wanna try a different API, take a look at Elixir Explorer:

https://hexdocs.pm/explorer/exploring_explorer.html

It runs on top of Polars so you get those speed gains, but uses the Elixir programming language. This gives the benefit of a simple finctional syntax w/ pipelines & whatnot.

It also benefits from the excellent Livebook (a Jupyter alternative specific to Elixir) ecosystem, which provides all kinds of benefits.

otsalomayesterday at 3:10 PM

Agreed, never had a problem with the speed of anything NumPy or Arrow based.

Here's my alternative: https://github.com/otsaloma/dataiter https://dataiter.readthedocs.io/en/latest/_static/comparison...

Planning to switch to NumPy 2.0 strings soon. Other than that I feel all the basic operations are fine and solid.

Note for anyone else rolling up their sleeves: You can get quite far with pure Python when building on top of NumPy (or maybe Arrow). The only thing I found needing more performance was group-by-aggregate, where Numba seems to work OK, although a bit difficult as a dependency.

epistasisyesterday at 6:22 PM

Have you examined siuba at all? It promises to be more similar to the R tidyverse, which IMHO has a much better API. And I personally prefer dplyr/tidyverse to Polars for exploratory analysis.

https://siuba.org

I have not yet used siuba, but would be interested in others' opinions. The activation energy to learn a new set of tools is so large that I rarely have the time to fully examine this space...

show 2 replies
kussenverbotenyesterday at 7:19 PM

Agree with this. My favorite syntax is the elegance of data.table API in R. This should be possible in Python too someday.

fluorinerocketyesterday at 9:11 PM

Thank you I don't know why people think it's so amazing. I end up sometimes just extracting the numpy arrays from the data frame and doing things like I know how to, because the Panda way is so difficult

stainablesteeltoday at 12:42 AM

i fell on dark days when they changed the multiindex reference level=N, which worked perfectly and was so logical and could be input alongside the axis, was swapped out in favor of a separate call for groupby

randomuser45678today at 1:24 AM

Check out https://ibis-project.org/

wodenokotoyesterday at 4:36 PM

In that case I’d recommend dplyr in R. It also integrates with a better plotting library, GGPlot, which not only gives you better API than matplotlib but also prettier plots (unless you really get to work at your matplot code)

te_chrisyesterday at 1:52 PM

Pandas best feature for me is the df format being readable by duckdb. The filtering api is a nightmare

nathan_comptonyesterday at 7:55 PM

Yeah. Pandas is the worst. Polars is better in some ways but so verbose!

adolphyesterday at 3:23 PM

So many foot guns, poorly thought through functions, 10s of keyword arguments instead of good abstractions

Yeah, Pandas has that early PHP feel to it, probably out of being a successful first mover.

Kalanosyesterday at 12:35 PM

The pandas API makes a lot more sense if you are familiar with numpy.

Writing pandas code is a bit redundant. So what?

Who is to say that fireducks won't make their own API?