Hmm. Folks trying to discover the elegant core of data frame manipulation by studying... pandas usage patterns. When R's dplyr solved this over a decade ago, mostly by respecting SQL and following its lead.
The pandas API feels like someone desperately needed a wheel and had never heard of a wheel, so they made a heptagon, and now millions of people are riding on heptagon wheels. Because it's locked in now, everyone uses heptagon wheels, what can you do? And now a category theorist comes along, studies the heptagon, and says hey look, you could get by on a hexagon. Maybe even a square or a triangle. That would be simpler!
No. Stop. Data frames are not fundamentally different from database tables [1]. There's no reason to invent a completely new API for them. You'll get within 10% of optimal just by porting SQL to your language. Which dplyr does, and then closes most of the remaining optimality gap by going beyond SQL's limitations.
You found a small core of operations that generates everything? Great. Also, did you know Brainfuck is Turing-complete? Nobody cares. Not all "complete" systems are created equal. A great DSL is not just about getting down to a small number of operations. It's about getting down to meaningful operations that are grammatically composable. The relational algebra that inspired SQL already nailed this. Build on SQL. Don't make up your own thing.
Like, what is "drop duplicates"? What are duplicates? Why would anyone need to drop them? That's a pandas-brained operation. You want the distinct keys defined by a select set of key columns, like SQL and dplyr provide.
Who needs a separate select and rename? Select is already using names, so why not do your name management there? One flexible select function can do it all. Again, like both SQL and dplyr.
Who needs a separate difference operation? There's already a type of join, the anti-join, that gets that done more concisely and flexibly, and without adding a new primitive, just a variation on the concept of a join. Again, like both SQL and dplyr.
Props to pandas for helping so many people who have no choice but to do tabular data analysis in Python, but the pandas API is not the right foundation for anything, not even a better version of pandas.
[1] No, row labels and transposition are not a good enough reason to regard them as different. They are both just structures that support pivoting, which is vastly more useful, and again, implemented by both R and many popular dialects of SQL.
I guess I have pandas brain because I definitely want to drop duplicates, 100% of the time I'm worried about duplicates and 99% of the time the only thing I want to do with duplicates is drop them. When you've got 19 columns it's _really fucking annoying_ if the tool you're using doesn't have an obvious way to say `select distinct on () from my_shit`. Close second at say, 98% of the time, I want to a get a count of duplicates as a sanity check because I know to expect a certain amount of them. Pandas makes that easy too in a way SQL makes really fucking annoying. There are a lot of parts on pandas that made me stop using it long ago but first class duplicates handling is not among them.
And the API is vastly superior to SQL is some respects from a user perspective despite being all over the place in others. Dataframe select/filtering e.g. df = df[df.duplicated(keep='last')] is simple, expressive, obvious, and doesn't result in bleeding fingers. The main problem is the rest of the language around it with all the indentations, newlines, loops, functions and so on can be too terse or too dense and much hard to read than SQL.
On reflection I think it's possible I may have missed the potential positive value of the post a bit. Maybe analyzing pandas gets you down to a set of data frame primitives that is helpful to build any API. Maybe the API you start with doesn't matter. I don't know. When somebody works hard to make something original, you should try to see the value in it, even if the approach is not one you would expect to be helpful.
I stand by my warnings against using pandas as a foundation for thinking about tabular data manipulation APIs, but maybe the work has value regardless.
Amen.
The author takes the 4 operations below and discusses some 3-operation thing from category theory. Not worth it, and not as clear as dplyr.
> But I kept looking at the relational operators in that table (PROJECTION, RENAME, GROUPBY, JOIN) and thinking: these feel related. They all change the schema of the dataframe. Is there a deeper relationship?
SQL only works on well defined data sets that obey relational calculus rules. Pandas is a power tool for dealing with data as you find it. Without Pandas you are stuck with tools like Excel.
> just by porting SQL to your language
You make it sound like writing an SQL parser and query engine is a trivial task. Have you ever looked at the implementation of a query engine to see what’s actually involved? You can’t just ‘build on SQL’, you have to build a substantial library of functions to build SQL on top of.