AI coding assistants are getting worse?

434 points • by voxadam • last Thursday at 3:20 PM • 700 comments • view on HN

Comments

llmslave2 • last Thursday at 10:01 PM

One thing I find really funny is when AI enthusiasts make claims about agents and their own productivity its always entirely anecdotally based on their own subjective experience, but when others make claims to the contrary suddenly there is some overwhelming burden of proof that has to be reached in order to make any sort of claims regarding the capabilities of AI workflows. So which is it?

➕ show 34 replies

renegade-otter • last Thursday at 3:54 PM

They are not worse - the results are not repeatable. The problem is much worse.

Like with cab hailing, shopping, social media ads, food delivery, etc: there will be a whole ecosystem, workflows, and companies built around this. Then the prices will start going up with nowhere to run. Their pricing models are simply not sustainable. I hope everyone realizes that the current LLMs are subsidized, like your Seamless and Uber was in the early days.

➕ show 17 replies

bee_rider • last Thursday at 4:06 PM

This seems like a kind of odd test.

> I wrote some Python code which loaded a dataframe and then looked for a nonexistent column.

    df = pd.read_csv(‘data.csv’)    
    df['new_column'] = df['index_value'] + 1
   #there is no column ‘index_value’

> I asked each of them [the bots being tested] to fix the error, specifying that I wanted completed code only, without commentary.

> This is of course an impossible task—the problem is the missing data, not the code. So the best answer would be either an outright refusal, or failing that, code that would help me debug the problem.

So his hoped-for solution is that the bot should defy his prompt (since refusal is commentary), and not fix the problem.

Maybe instructability has just improved, which is a problem for workflows that depend on misbehavior from the bot?

It seems like he just prefers how GPT-4 and 4.1 failed to follow his prompt, over 5. They are all hamstrung by the fact that the task is impossible, and they aren’t allowed to provide commentary to that effect. Objectively, 4 failed to follow the prompts in 4/10 cases and made nonsense changes in the other 6; 4.1 made nonsense changes; and 5 made nonsense changes (based on the apparently incorrect guess that the missing ‘index_value’ column was supposed to hold the value of the index).

➕ show 4 replies

anttiharju • yesterday at 5:21 AM

I like AI for software development.

Sometimes I am uncertain whether it's an absolute win. Analogy: I used to use Huel to save time on lunches to have more time to study. Turns out, lunches were not just refueling sessions but ways to relax. So I lost on that relaxation time and it ended up being +-0 long-term.

AI for sure is net positive in terms of getting more done, but it's way too easy to gloss over some details and you'll end up backtracking more.

"Reality has a surprising amount of detail" or something along those lines.

➕ show 4 replies

jackfranklyn • yesterday at 4:49 PM

The measurement problem here is real. "10x faster" compared to what exactly? Your best day or your average? First-time implementation or refactoring familiar code?

I've noticed my own results vary wildly depending on whether I'm working in a domain where the LLM has seen thousands of similar examples (standard CRUD stuff, common API patterns) versus anything slightly novel or domain-specific. In the former case, it genuinely saves time. In the latter, I spend more time debugging hallucinated approaches than I would have spent just writing it myself.

The atrophy point is interesting though. I wonder if it's less about losing skills and more about never developing them in the first place. Junior developers who lean heavily on these tools might never build the intuition that comes from debugging your own mistakes for years.

ronbenton • last Thursday at 3:53 PM

I am used to seeing technical papers from ieee, but this is an opinion piece? I mean, there is some anecdata and one test case presented to a few different models but nothing more.

I am not necessarily saying the conclusions are wrong, just that they are not really substantiated in any way

➕ show 5 replies

CashWasabi • last Thursday at 3:58 PM

I always wonder what happens when LLMs finally destroyed every source of information they crawl. After stack overflow and forums are gone and when there's no open source code anymore to improve upon. Won't they just canibalize themselves and slowly degrade?

➕ show 6 replies

theptip • last Thursday at 4:18 PM

They are not getting worse, they are getting better. You just haven't figured out the scaffolding required to elicit good performance from this generation. Unit tests would be a good place to start for the failure mode discussed.

As others have noted, the prompt/eval is also garbage. It’s measuring a non-representative sub-task with a weird prompt that isn’t how you’d use agents in, say, Claude Code. (See the METR evals if you want a solid eval giving evidence that they are getting better at longer-horizon dev tasks.)

This is a recurring fallacy with AI that needs a name. “AI is dumber than humans on some sub-task, therefore it must be dumb”. The correct way of using these tools is to understand the contours of their jagged intelligence and carefully buttress the weak spots, to enable the super-human areas to shine.

➕ show 7 replies

Kuinox • last Thursday at 3:50 PM

I speculate LLMs providers are serving smallers models dynamically to follow usage spikes, and need for computes to train new models. I did observed that models agents are becoming worse over time, especially before a new model is released.

➕ show 2 replies

bodge5000 • yesterday at 4:50 PM

A little off topic, but this seems like one of the better places to ask where I'm not gonna get a bunch of zealotry; a question for those of you who like using AI for software development, particularly using Claude Code or OpenCode.

I'll admit I'm a bit of a sceptic of AI but want to give it another shot over the weekend, what do people recommend these days?

I'm happy spending money but obviously don't want to spend a tonne since its just an experiment for me. I hear a lot of people raving about Opus 4.5, though apparently using that is near to $20 a prompt, Sonnet 4.5 seems a lot cheaper but then I don't know if I'm giving it (by it I mean AI coding) a fair chance if Opus is that much better. There's also OpenCode Zen, which might be a better option, I don't know.

➕ show 4 replies

nyrikki • last Thursday at 11:10 PM

> If an assistant offered up suggested code, the code ran successfully, and the user accepted the code, that was a positive signal, a sign that the assistant had gotten it right. If the user rejected the code, or if the code failed to run, that was a negative signal, and when the model was retrained, the assistant would be steered in a different direction.

> This is a powerful idea, and no doubt contributed to the rapid improvement of AI coding assistants for a period of time. But as inexperienced coders started turning up in greater numbers, it also started to poison the training data.

It is not just `inexperienced coders` that make this signal pretty much useless, I mostly use coding assistants for boilerplate, I will accept the suggestion then delete much of what it produced, especially in the critical path.

For many users, this is much faster then trying to get another approximation

     :,/^}/-d

Same for `10dd` etc... it is all muscle memory. Then again I use a local fill in the middle, tiny llm now, because it is good enough for most of the speedup without the cost/security/latency of a hosted model.

It would be a mistake to think that filtering out jr devs will result in good data as the concept is flawed in general. Accepting output may not have anything to do with correctness of the provided content IMHO.

lucideng • yesterday at 5:08 PM

This quote feels more relevant than ever:

> Give a man a fish, and you feed him for a day. Teach a man to fish, and you feed him for a lifetime.

Or in the context of AI:

> Give a man code, and you help him for a day. Teach a man to code, and you help him for a lifetime.

➕ show 1 reply

sosodev • last Thursday at 4:07 PM

He asked the models to fix the problem without commentary and then… praised the models that returned commentary. GPT-5 did exactly what he asked. It doesn’t matter if it’s right or not. It’s the essence of garbage in and garbage out.

➕ show 1 reply

jackfranklyn • yesterday at 9:21 AM

The quality variation from month to month has been my experience too. I've noticed the models seem to "forget" conventions they used to follow reliably - like proper error handling patterns or consistent variable naming.

What's strange is sometimes a fresh context window produces better results than one where you've been iterating. Like the conversation history is introducing noise rather than helpful context. Makes me wonder if there's an optimal prompt length beyond which you're actually degrading output quality.

➕ show 2 replies

dathinab • yesterday at 2:51 PM

In general "failing to run (successfully)" should per-see been seen as a bad signal.

It might still be:

- the closest to a correct solution the model can produce

- be helpful to find out what it wrong

- might be intended (e.g. in a typical very short red->green unit test dev approach you want to generate some code which doesn't run correctly _just yet_). Test for newly found bugs are supposed to fail (until the bug is fixed). Etc.

- if "making run" means removing sanity checks, doing something semantically completely different or similar it's like the OP author said on of the worst outcomes

kristopolous • yesterday at 12:05 AM

I stopped using them. Occasionally I go back to see if it's better but really I just treat them as a more interactive stackoverflow/google.

I've been stung by them too many times.

The problem is the more I care about something, the less I'll agree with whatever the agent is trying to do.

amarka • yesterday at 5:51 PM

While the author’s (banker and a data scientist) experience is clearly valuable, it is unclear whether it alone is sufficient to support the broader claims made. Engineering conclusions typically benefit from data beyond individual observation.

StarlaAtNight • last Thursday at 3:36 PM

We should be able to pin to a version of training data history like we can pin to software package versions. Release new updates w/ SemVer and let the people decide if it’s worth upgrading to

I’m sure it will get there as this space matures, but it feels like model updates are very force-fed to users

➕ show 5 replies

crazygringo • last Thursday at 3:53 PM

This is a sweeping generalization based on a single "test" of three lines that is in no way representative.

➕ show 2 replies

Hobadee • yesterday at 2:30 PM

> If an assistant offered up suggested code, the code ran successfully, and the user accepted the code, that was a positive signal, a sign that the assistant had gotten it right.

So what about all those times I accepted the suggestion because it was "close enough", but then went back and fixed all the crap that AI screwed up? Was it training on what was accepted the first time? If so I'm sincerely sorry to everyone, and I might be single-handedly responsible for the AI coding demise. :'-D

winddude • yesterday at 3:57 PM

Not sure I agree with his tests, but I agree with the headline, I recently had cursor launch into seemingly endless loops of grepping and `cd` and `ls` files. This was in multiple new convos. I think it's they're trying to do to much, for two many "vibe coders", and the lighter weight version that did less were easier to steer to meet your architecture and needs.

dudeinhawaii • yesterday at 7:28 PM

The most annoying thing in the LLM space is that people write articles and research with grand pronouncements based upon old models. This article has no mention of Sonnet 4.5, nor does it use any of the actual OpenAI coding models (GPT-5-Codex, GPT-5.1 Codex, etc), and based upon that, even the Opus data is likely an older version.

This then leads to a million posts where on one side people say "yeah see they're crap" and on the other side people are saying "why did you use a model from 6 months ago for your 'test' and write up in Jan 2026?".

You might as well ignore all of the articles and pronouncements and stick to your own lived experience.

The change in quality between 2024 and 2025 is gigantic. The change between early 2025 and late 2025 is _even_ larger.

The newer models DO let you know when something is impossible or unlikely to solve your problem.

Ultimately, they are designed to obey. If you authoritatively request bad design, they're going to write bad code.

I don't think this is a "you're holding it wrong" argument. I think it's "you're complaining about iOS 6 and we're on iOS 12.".

amelius • last Thursday at 3:42 PM

A dataset with only data from before 2024 will soon be worth billions.

➕ show 3 replies

maxbaines • last Thursday at 3:52 PM

Not seeing this in my day to day, in fact the opposite.

➕ show 2 replies

reassess_blind • yesterday at 9:53 AM

I only have experience with using it within my small scope, being full stack NodeJS web development (i.e an area with many solved problems and millions of lines of existing code for the models to reference), but my experience with the new Opus model in Claude Code has been phenomenal.

➕ show 1 reply

kristianp • last Thursday at 11:00 PM

The failure mode of returning code that only appears to work correctly is one I've encountered before. I've had Sonnet (4 I think) generate a bunch of functions that check if parameter values are out of valid range and just return without error when they should be a failing assertion. That kind of thing does smell of training data that hasn't been checked for correctness by experienced coders.

Edit: Changed 3.5 to 4.

Edit: Looking back to edits and checkins by AI agents, it strikes me that the checkins should contain the prompt used and model version. More recent Aider versions do add the model.

anttiharju • yesterday at 6:12 PM

I've felt this. Bit scary given how essential of a tool it has become.

I started programming before modern LLMs so I can still hack it without, it will just take a lot longer.

furyofantares • last Thursday at 7:06 PM

He graded GPT 4 as winning because it didn't follow his instructions. And the instructions are unrealistic to anyone using coding assistants.

Maybe it's true that for some very bad prompts, old version did a better job by not following the prompt, and that this is reduced utility for some people.

Unrelated to assistants or coding, as an API user I've certainly had model upgrades that feel like downgrades at first, until I work out that the new model is following my instructions better. Sometimes my instructions were bad, sometimes they were attempts to get the older model to do what I want by saying over-the-top stuff that the new model now follows more precisely to a worse result. So I can definitely imagine that new models can be worse until you adapt.

Actually, another strange example like this - I had gotten in the habit of typing extremely fast to LLMs because they work just fine with my prompts riddled with typos. I basically disconnected the part of my brain that cares about sequencing between hands, so words like "can" would be either "can" or "cna". This ended up causing problems with newer models which would take my typos seriously. For example, if I ask to add support for commandline flag "allwo-netwokr-requests" it will usually do what I said, while previous versions would do what I wanted.

For anyone with some technical expertise and who is putting in serious effort to using AI coding assistants, they are clearly getting better at a rapid pace. Not worse.

minimaxir • last Thursday at 3:54 PM

The article uses pandas as a demo example for LLM failures, but for some reason, even the latest LLMs are bad at data science code which is extremely counterintuitive. Opus 4.5 can write a EDA backbone but it's often too verbose for code that's intended for a Jupyter Notebook.

The issues have been less egregious than hallucinating an "index_value" column, though, so I'm suspect. Opus 4.5 still has been useful for data preprocessing, especially in cases where the input data is poorly structured/JSON.

➕ show 1 reply

shevy-java • yesterday at 3:42 AM

I find the whole idea of AI coding assistants strange.

For me, the writing speed has never been the issue. The issue has been my thinking speed. I do not see how an AI coding assistant helps me think better. Offloading thinking actually makes my thinking process worse and thus slower.

➕ show 2 replies

chankstein38 • yesterday at 3:27 PM

The issue is NOT particular to the GPT models. Gemini does this stuff to me all of the time as well! Bandaids around actual problems, hides debugging, etc. They're just becoming less usable.

cons0le • last Thursday at 3:58 PM

And the Ads aren't even baked in yet . . . that's the end goal of every company

➕ show 1 reply

troyvit • last Thursday at 3:58 PM

There's really not much to take from this post without a repo and a lot of supporting data.

I wish they would publish the experiment so people could try with more than just GPT and Claude, and I wish they would publish their prompts and any agent files they used. I also wish they would say what coding tool they used. Like did they use the native coding tools (Claude Code and whatever GPT uses) or was it through VSCode, OpenCode, aider, etc.?

stared • last Thursday at 3:41 PM

Is it possible to re-run it? I am curious for Gemini 3 Pro.

As a side note, it is easy to create sharable experiments with Harbor - we migrated our own benchmarks there, here is our experience: https://quesma.com/blog/compilebench-in-harbor/.

erelong • yesterday at 3:19 AM

Interesting if true but I would presume it to be negligible in comparison to magnitudes of gains over "manual coding" still, right? So nothing to lose sleep over at the moment...

pablonm • yesterday at 1:16 PM

I noticed Claude Code (on a 100$ max subscription) has become slower for me in the last few weeks. Just yesterday it spent hours coding a simple feature Which I could have coded myself faster.

Johnny555 • yesterday at 2:09 AM

But as inexperienced coders started turning up in greater numbers, it also started to poison the training data.

I think all general AI agents are running into that problem - as AI becomes more prevalent and people accept and propagate wrong answers, the AI agents are trained to believe those wrong answers.

It feels that lately, Google's AI search summaries are getting worse - they have a kernel of truth, but combines it with an incorrect answer.

bob1029 • last Thursday at 4:03 PM

> My team has a sandbox where we create, deploy, and run AI-generated code without a human in the loop.

I think if you keep the human in the loop this would go much better.

I've been having a lot of success recently by combining recursive invocation with an "AskHuman" tool that takes a required tuple of (question itself, how question unblocks progress). Allowing unstructured assistant dialog with the user/context is a train wreck by comparison. I've found that chain-of-thought (i.e., a "Think" tool that barfs into the same context window) seems to be directly opposed to the idea of recursively descending through the problem. Recursion is a much more powerful form of CoT.

falldrown • yesterday at 12:37 AM

Codex is still useful for me. But I don't want to pay $200/month for it.

> To start making models better again, AI coding companies need to invest in high-quality data, perhaps even paying experts to label AI-generated code.

AI trainers hired by companies like Outlier, Mercor and Alignerr are getting paid like $15-$45/hr. Reviewers are crap. The screening processes are horribly done by AI interviewers.

➕ show 1 reply

isodev • yesterday at 1:57 AM

> It does this by removing safety checks, or by creating fake output that matches the desired format, or through a variety of other techniques to avoid crashing during execution.

So much this... the number of times Claude sneaks default values, or avoids .unwrapping optional values just to avoid a crash at all costs... it's nauseating.

mat_b • yesterday at 3:42 AM

I have been noticing this myself for the last couple of months. I cannot get the agent to stop masking failures (ex: swallowing exceptions) and to fail loudly.

That said, the premise that AI-assisted coding got worse in 2025 feels off to me. I saw big improvements in the tooling last year.

➕ show 1 reply

metobehonest • last Thursday at 4:28 PM

I can imagine Claude getting worse. I consider myself bearish on AI in general and have long been a hater of "agentic" coding, but I'm really liking using aider with the deepseek API on my huge monorepo.

Having tight control over the context and only giving it small tasks makes all the difference. The deepseek token costs are unbeatable too.

jvanderbot • last Thursday at 3:38 PM

Likely, and I'm being blithe here, it's because of great acceptance. If we try it on more difficult code, it'll fail in more difficult ways?

Until we start talking about LOC, programming language, domain expertise required, which agent, which version, and what prompt, it's impossible to make quantitative arguments.

emsign • yesterday at 7:14 AM

When coding assistants take longer, is because they use more tokens, is because AI companies are obligated to make more money.

radium3d • last Thursday at 4:18 PM

The problem is everyone is using a different “level” of AI model. Experiences by those who can only afford or choose not to pay for the advanced reasoning are far worse than those who can and do pay.

nhd98z • last Thursday at 4:11 PM

This guy is using AI in the wrong way...

j45 • yesterday at 6:10 PM

It feels like the more standardized the organization, or the more academic the background of an author, the more lagging their insights from the tip of the arrow.

It's clear AI coding assistants are able to help software developers at least in some ways.

Having a non-software developer perspective speak about it is one thing, but it should be mindful that there are experienced folks too for whom the technology appears to be a jetpack.

Just because it didn't work for you, means there's more to learn.

PunchTornado • yesterday at 10:18 AM

ChatGPT is getting worse and is a useless model. Surprised that people are still using it. The article tests only this model.

renarl • last Thursday at 4:03 PM

Strange that the article talks about ChatGPT 4 and 5 but not the latest 5.2 model.

➕ show 1 reply

alt Hacker News

AI coding assistants are getting worse?

Comments

🔗 View 27 more comments