logoalt Hacker News

pjmlp10/01/20242 repliesview on HN

Most of the Python libraries, are anyway bindings to native libraries.

Any other ecosystem is able to plug into the same underlying native libraries, or even call them directly in case of being the same language.

In a way it is kind of interesting the performance pressure that is going on Python world, otherwise CPython folks would never reconsider changing their stance on performance.


Replies

OptionOfT10/01/2024

Most of these native libraries' output isn't 1-1 mappable to Python. Based on the data you need to write native data wrappers, or worse, marshal the data into managed memory. The overhead can be high.

It gets worse because Python doesn't expose you to memory management. This initially is an advantage, but later on causes bloat.

Python is an incredibly easy interface over these native libraries, but has a lot of runtime costs.

show 3 replies
oersted10/01/2024

Indeed, but Python is used to orchestrate all these lower-level libraries. If you have Python on top, you often want to call these libraries on a loop, or more often, within parallelized multi-stage pipelines.

Overhead and parallelization limitations become a serious issue then. Frameworks like PySpark take your Python code and are able to distribute it better, but it's still (relatively) very slow and clunky. Or they can limit what you can do to a natively implemented DSL (often SQL, or some DataFrame API, or an API to define DAGs and execute them within a native engine), but you can't to much serious data work without UDFs, where again Python comes in. There are tricks but you can never really avoid the limitations of the Python interpreter.