They are not getting worse, they are getting better. You just haven't figured out the scaffolding required to elicit good performance from this generation. Unit tests would be a good place to start for the failure mode discussed.
As others have noted, the prompt/eval is also garbage. It’s measuring a non-representative sub-task with a weird prompt that isn’t how you’d use agents in, say, Claude Code. (See the METR evals if you want a solid eval giving evidence that they are getting better at longer-horizon dev tasks.)
This is a recurring fallacy with AI that needs a name. “AI is dumber than humans on some sub-task, therefore it must be dumb”. The correct way of using these tools is to understand the contours of their jagged intelligence and carefully buttress the weak spots, to enable the super-human areas to shine.
Needing the right scaffolding is the problem.
Today I asked 3 versions of Gemini “what were sales in December” with access to a sql model of sales data.
All three ran `WHERE EXTRACT(MONTH FROM date) = 12` with no year (except 2.5 flash did sometimes gave me sales for Dec 2023).
No sane human would hear “sales from December” and sum up every December. But it got numbers that an uncritical eye would miss being wrong.
That’s the type of logical error that these models produce that are bothering the author. They can be very poor at analysis in real world situations because they do these things.
"They are not getting worse, they are getting better. You just haven't figured out the scaffolding required to elicit good performance from this generation. Unit tests would be a good place to start for the failure mode discussed."
Isn't this the same thing? I mean this has to work with like regular people right?
I'm referring to these kind of articles as "Look Ma, I made the AI fail!"
I’ve seen some correlation between people who write clean and structured code, follow best practices and communicate well through naming and sparse comments, and how much they get out of LLM coding agents. Eloquence and depth of technical vocabulary seem to be a factor too.
Make of that what you will…
Having to prime it with more context and more guardrails seems to imply they're getting worse. That's fewer context and guardrails it can infer/intuit.
[dead]
So basically “you’re holding it wrong?”