logoalt Hacker News

oersted10/01/20240 repliesview on HN

Indeed, but Python is used to orchestrate all these lower-level libraries. If you have Python on top, you often want to call these libraries on a loop, or more often, within parallelized multi-stage pipelines.

Overhead and parallelization limitations become a serious issue then. Frameworks like PySpark take your Python code and are able to distribute it better, but it's still (relatively) very slow and clunky. Or they can limit what you can do to a natively implemented DSL (often SQL, or some DataFrame API, or an API to define DAGs and execute them within a native engine), but you can't to much serious data work without UDFs, where again Python comes in. There are tricks but you can never really avoid the limitations of the Python interpreter.