> Yes, every time I write df[df.sth = val], a tiny part of me dies. That's because it'...

oreilles • 11/20/2024 • 4 replies • view on HN

> Yes, every time I write df[df.sth = val], a tiny part of me dies.

That's because it's a bad way to use Pandas, even though it is the most popular and often times recommended way. But the thing is, you can just write "safe" immutable Pandas code with method chaining and lambda expressions, resulting in very Polars-like code. For example:

  df = (
    pd
    .read_csv("./file.csv")
    .rename(columns={"value":"x"})
    .assign(y=lambda d: d["x"] * 2)
    .loc[lambda d: d["y"] > 0.5]
  )

Plus nowadays with the latest Pandas versions supporting Arrow datatypes, Polars performance improvements over Pandas are considerably less impressive.

Column-level name checking would be awesome, but unfortunately no python library supports that, and it will likely never be possible unless some big changes are made in the Python type hint system.

Replies

wodenokoto • 11/20/2024

I’m not really sure why you think

    .loc[lambda d: d["y"] > 0.5]

Is stylistically superior to

    [df.y > 0.5]

I agree it comes in handy quite often, but that still doesn’t make it great to write compared to what sql or dplyr offers in terms of choosing columns to filter on (`where y > 0.5`, for sql and `filter(y > 0.5)`, for dplyr)

➕ show 3 replies

OutOfHere • 11/20/2024

Using `lambda` without care is dangerous because it risks being not vectorized at all. It risks being super slow, operating one row at a time. Is `d` a single row or the entire series or the entire dataframe?

➕ show 1 reply

rogue7 • 11/20/2024

Agreed 100%. I am using this method-chaining style all the time and it works like a charm.

moomin • 11/20/2024

I mean, yes there’s arrow data types, but it’s got a long way to go before it’s got full parity with the numpy version.

alt Hacker News

Replies