Yes, every time I write df[df.sth = val], a tiny part of me dies. For a comparison, dplyr offers a...

stared • 11/20/2024 • 3 replies • view on HN

Yes, every time I write df[df.sth = val], a tiny part of me dies.

For a comparison, dplyr offers a lot of elegant functionality, and the functional approach in Pandas often feels like an afterthought. If R is cleaner than Python, it tells a lot (as a side note: the same story for ggplot2 and matplotlib).

Another surprise for friends coming from non-Python backgrounds is the lack of column-level type enforcement. You write df.loc[:, "col1"] and hope it works, with all checks happening at runtime. It would be amazing if Pandas integrated something like Pydantic out of the box.

I still remember when Pandas first came out—it was fantastic to have a tool that replaced hand-rolled data structures using NumPy arrays and column metadata. But that was quite a while ago, and the ecosystem has evolved rapidly since then, including Python’s gradual shift toward type checking.

Replies

oreilles • 11/20/2024

> Yes, every time I write df[df.sth = val], a tiny part of me dies.

That's because it's a bad way to use Pandas, even though it is the most popular and often times recommended way. But the thing is, you can just write "safe" immutable Pandas code with method chaining and lambda expressions, resulting in very Polars-like code. For example:

  df = (
    pd
    .read_csv("./file.csv")
    .rename(columns={"value":"x"})
    .assign(y=lambda d: d["x"] * 2)
    .loc[lambda d: d["y"] > 0.5]
  )

Plus nowadays with the latest Pandas versions supporting Arrow datatypes, Polars performance improvements over Pandas are considerably less impressive.

Column-level name checking would be awesome, but unfortunately no python library supports that, and it will likely never be possible unless some big changes are made in the Python type hint system.

➕ show 4 replies

doctorpangloss • 11/20/2024

All I want is for the IDE and Python to correctly infer types and column names for all of these array objects. 99% of the pain for me is navigating around SQL return values and CSVs as pieces of text instead of code.

bdjsiqoocwk • 11/21/2024

Nonsense, if you understand why df[df.sh ==val] you'll see it's great. If you don't, you can also do df.query("sh == val").

➕ show 1 reply

alt Hacker News

Replies