Surprisingly, LLMs are actually much worse at reasoning in Python than other common programming lang...

gertlabs • today at 1:58 AM • 15 replies • view on HN

Surprisingly, LLMs are actually much worse at reasoning in Python than other common programming languages for agentic coding tasks.

Data here: https://gertlabs.com/rankings?mode=agentic_coding

Replies

BariumBlue • today at 2:23 AM

Hah, I was just thinking that Python likely has a vast ocean of training data, but it's likely of lower quality, being much of it is written by beginners and those who aren't primarily programmers.

➕ show 7 replies

stingraycharles • today at 5:42 AM

I’m super surprised that C++ scores so high, this does not match our experience at all, and for anything performance critical it always drops the ball completely.

I also don’t understand how these “games” map to real world complex problems. How are you measuring success? How does “adversarial customer service” map to “this LLM is better at C++ than the other” ? How are you sure you’re not just benchmarking language suitability for a problem ?

I have so many questions about this…

➕ show 1 reply

isityettime • today at 2:50 AM

I would love to see how they do with functional languages and especially Lisps here. I've noticed pretty good performance with Emacs Lisp relative to overall model strength, but I haven't used LLMs to application code in any such languages.

It would also be interesting to see how Python compares to other languages in its niche (Ruby, Perl, Raku).

Thanks for putting this together! It's interesting.

➕ show 3 replies

fulafel • today at 5:03 AM

What would comparing rates across languages tell in the context of this benchmark? Are the tasks the same or robustly difficulty-normalized across the languages?

Also somehow the 2 language comparison graphs (avg percentile and success rate) rank Python in dramatically different positions, with Python outranking Rust and Java in the success rate. What does the avg percentile mean in this context?

➕ show 1 reply

robot-wrangler • today at 5:09 AM

> Data here: https://gertlabs.com/rankings?mode=agentic_coding

Oh wow, we got "tribal domination", "market simulator" and "adversarial customer service". I don't know what those are but it sure sounds like big torment nexus milestones

Maybe we could at least play nicer games like hackenbush and act surprised when there's some wicked use-case that's isomorphic.

EDIT: Ok fine. I like "Rubik's Cube Chess" a lot. Never heard of it, is this analyzed formally at all? Hard to search for since there's tons of collisions

riedel • today at 6:43 AM

My feeling is that for agentic tasks this is not only language design but also LSPs, error messages and static analysis capabilities that dominate the benchmarks. It would IMHO be interesting to look into better subsets of python and style/rewrite techniques as well as alternative linter and their effects on performance.

➕ show 3 replies

js8 • today at 4:22 AM

The LLMs are generally still pretty bad at (deductive) reasoning. IME they go along more with the things like variable names and comments than the actual program logic (it would be an interesting experiment to compare LLM's understanding of three identical programs with different identifiers, one with normal identifiers, one with obfuscated identifiers, and one with deliberately misleading identifiers). I also think this particular comparison comes down to typing, which helps to avoid LLM's reasoning go astray.

When we reason we need to typically propagate the constraints to arrive at a solution to these constraints. I think the best language to reason in could be something like Lean, which allows both constraints and actual code to be expressed at the same time. Although this might not be the case for current LLMs, as I explain above.

➕ show 1 reply

bushbaba • today at 2:42 AM

Cool to see my hunch be backed by data. Python is a scripting language with OOP bolted on. Means there’s not really a styling consistency that other languages have, with things tending to look like PHP, a collection of various scripts that invoke one another

➕ show 2 replies

w0m • today at 3:31 AM

Huh. This surprises me. Digging, it seems it looks like it comes down to interpreted + dynamically typed vs compiled and statically typed.

TIL. If i were to start a truly vibe project; Go would have a significant leg up.

➕ show 1 reply

hooloovoo_zoo • today at 6:47 AM

Mm, the code is constrained to run inside a game 'tick'?

andai • today at 8:10 AM

I thought it might have to do with the type system, but JavaScript type system is atrocious and it scores about 50% higher. So my theory does not make much sense.

rossjudson • today at 2:26 AM

My standard joke here:

Q: Say, what does this Python code do?

A: Nobody f&%^ing knows.

➕ show 1 reply

altmanaltman • today at 3:12 AM

Hey they said it had a lot of training data, not necessarily high-quality python code training data.

ricardo_lien • today at 3:11 AM

This surprised me, but I can understand it - Python sucks in many ways lol.

goodmattg • today at 4:03 AM

[dead]

alt Hacker News

Replies