Im a bit out of the loop with this, but hope its not like that time with python 3.14, when it was cl...

jtrn • yesterday at 3:52 PM • 2 replies • view on HN

Im a bit out of the loop with this, but hope its not like that time with python 3.14, when it was claimed a geometric mean speedup of about 9-15% over the standard interpreter when built with Clang 19. It turned out the results were inflated due to a bug in LLVM 19 that prevented proper "tail duplication" optimization in the baseline interpreter's dispatch loop. Actual gains was aprox 4%.

Edit: Read through it and have come to the conclusion that the post is 100% OK and properly framed: He explicitly says his approach is to "sharing early and making a fool of myself," prioritizing transparency and rapid iteration over ironclad verification upfront.

One could make an argument that he should have cross-compiler checks, independent audits, or delayed announcements until results are bulletproof across all platforms. But given that he is 100% transparent with his thinking and how he works, it's all good in the hood.

Replies

kenjin4096 • yesterday at 4:28 PM

Thanks :), that was indeed my intention. I think the previous 3.14 mistake was actually a good one on hindsight, because if I didn't publicize our work early, I wouldn't have caught the attention of Nelson. Nelson also probably wouldn't have spent one month digging into the Clang 19 bug. This also meant the bug wouldn't have been caught in the betas, and might've been out with the actual release, which would have been way worse. So this was all a happy accident on hindsight that I'm grateful for as it means overall CPython still benefited!

Also this time, I'm pretty confident because there are two perf improvements here: the dispatch logic, and the inlining. MSVC can actually convert switch-case interpreters to threaded code automatically if some conditions are met [1]. However, it does not seem to do that for the current CPython interpreter. In this case, I suspect the CPython interpreter loop is just too complicated to meet those conditions. The key point also that we would be relying on MSVC again to do its magic, but this tail calling approach gives more control to the writers of the C code. The inlining is pretty much impossible to convince MSVC to do except with `__forceinline` or changing things to use macros [2]. However, we don't just mark every function as forceinline in CPython as it might negatively affect other compilers.

[1]: https://github.com/faster-cpython/ideas/issues/183 [2]: https://github.com/python/cpython/issues/121263

➕ show 1 reply

haberman • yesterday at 7:03 PM

I’ll repeat what I said at that time: one of the benefits of the new design is that it’s less vulnerable to the whims of the optimizer: https://news.ycombinator.com/item?id=43322451

If getting the optimal code is relying on getting a pile of heuristics to go in your favor, you’re more vulnerable to the possibility that someday the heuristics will go the other way. Tail duplication is what we want in case, but it’s possible that a future version of the compiler could decide that it’s not desired because of the increased code size.

With the new design, the Python interpreter can express the desired shape of the machine code more directly, leaving it less vulnerable to the whims of the optimizer.

➕ show 1 reply

alt Hacker News

Replies