a lot of the training data is either for python 2 or just generally very low quality
That could be it. I still see LLMs fail a set of static typing challenges that I created a couple years ago as a benchmark. Google models still fail it. I wonder if the lack of typing in a lot of the training data makes python harder to reason about?
The quality issue doesn't seem unique to Python.
The versioning issue I've seen across libraries that version change in many languages.
I don't tend to hit Python 2 issues using LLMs with it, but I do hit library things (e.g. Pydantic likes to make changes between libraries - or loads of the libraries used a lot by AI companies).