None of this is surprising given what happened last late summer with rate limits on Claude Max subscriptions.
And less so if you read [1] or similar assessments. I, too, believe that every token is subsidized heavily. From whatever angle you look at it.
Thusly quality/token/whatever rug pulls are inevitable, eventually. This is just another one.
Thank you. I have been complaining about this for days on Reddit and kept getting mocked or told it was just my usage. Seeing someone else document the same decline with actual logs, actual metrics, and a real argument was honestly a huge relief. Your issue was posted almost at the same time as my own posts yesterday. That timing hit me hard. Finally, I do not feel like I was shouting into the void.
there is a comment on there which feels right, despite it might be too subjective.
Ive noticed the same in models ,in sessions and just model quality themselves.. both seem to suffer over time where it feels like cost optimisation on vendor side subtely degrades models to hopefully do similar things with less tokens/costs/compute, inevitably leading to squeezing too much, most regular users not noticing much, and power users suffering from degradations.
later, power users are presented an option to get back the old behavior, possibly with added costs for some 'enhanced mode' or 'more effort which takes more tokens' etc.
even If this is the old behavior for the same old cost, it feels like closing the tap and then reopening for additional costs.
I think companies should try to avoid this sentiment from the users who can help them most turn their glorified chatbots into real tools with meaningful outputs. (ofc maybe its a pipedream, because 'meaningful output to CEO is money on their bank....)
Anecdotally, I’ve been seeing a lot of weird behavior from Opus when it decides, mid-execution, to switch to a different "simpler" solution, and that really pissed me off.
At one point, I carefully designed a spec document, forced Opus to reread it, create a plan with the planning tool that followed the spec, and use the task tool to track the implementation... AND AFTER OPUS READS THE FIRST FUCKING FILE, it says, "Oh, there are missing dependencies in project X. It’ll be hard to add them, so I’m going to throw away the whole plan and just do a simple fix..."
After that, I canceled my $200 Max plan, which I’d been subscribed to since June 2025, and decided to check out Codex
I still use 4.5. I occasionally try 4.6 but always switch back. The “bias towards action” is what I hate. 4.5 would make sure it understands what I want. 4.6 will just make shit up. Maybe the Anthropic people always write crystal clear instructions so it works for them. For me, I just can’t get 4.6 to do what I want.
Obviously it's entirely unprovable but it all aligns in very suspicious ways with a compelling narrative:
Anthropic simply can't actually scale Claude Code to meet the opportunity right now. Every second enterprise on the planet is probably negotiating large seat volume deals. It's a race for survival against the other players. The sales team is making huge promises engineering and ops can't fulfil.
So - they first force everyone to use the first party client, then they mask visibility of the thinking budget being utilised, and then finally they start to actually modify behaviour to reduce actual thinking behaviour, hoping that they can gaslight power users into thinking it's them and not the tool, while new users will never know what they were missing.
Is the narrative true? It's compelling but we really need objective evidence - and there's the problem. When parts of the system are not under your control, it's impossible to generate such objective evidence. Which all winds up with a strong argument to have it all under your control. If it didn't happen this time, it probably will. Enshittification is a fundamental human behavioral constant.
I have found that Claude Opus 4.6 is a better reviewer than it is an implementer. I switch off between Claude/Opus and Codex/GPT-5.4 doing reviews and implementations, and invariably Codex ends up having to do multiple rounds of reviews and requesting fixes before Claude finally gets it right (and then I review). When it is the other way around (Codex impl, Claude review), it's usually just one round of fixes after the review.
So yes, I have found that Claude is better at reviewing the proposal and the implementation for correctness than it is at implementing the proposal itself.
I noticed this almost immediately when attempting to switch to Opus 4.6. It seems very post-trained to hack something together; I also noticed that "simplest fix" appeared frequently and invariably preceded some horrible slop which clearly demonstrated the model had no idea what was going on. The link suggests this is due to lack of research.
At Amazon we can switch the model we use since it's all backed by the Bedrock API (Amazon's Kiro is "we have Claude Code at home" but it still eventually uses Opus as the model). I suppose this means the issue isn't confined to just Claude Code. I switched back to Opus 4.5 but I guess that won't be served forever.
Ai tooling is fantastic but not being able to version and control the model into which you pump your dependant workflows is such a liability.
Got tired of using claude using 10% of the usage for the first prompt. I have shifted back to coding myself again. Asking claude to do only initial bootstraping /large complex task
I'm genuinely curious why some of these results are so terrible for so many people. I've built in my own harness, and while I've noticed a degradation of quality, the local harness - as well as validation agents - generally catch these issues. For me, I've had to institute tighter controls and guardrails via hooks but I don't see results that warrant changing to a different provider.
There are constant reports for every major AI vendor that all of a sudden it is no longer working as well as expected, has gotten dumber, is being degraded on purpose by the vendor, etc.
Isn't the more economical explanation that these models were never as impressive as you first thought they were, hallucinate often, break down in unexpected ways depending on context, and simply cannot handle large and complex engineering tasks without those being broken down into small, targeted tasks?
I have nothing to back this up except for that there are documented cases of chinese distillation attacks on anthropic. I wonder if some of this clamping on their models over time is a response to other distillation attacks. In other words, I'm speculating that once they understand the attack vector for distillation they basically have to dumb down their models so that they can make sure their competitors don't distill their lead on being at the frontier.
Is the era of succinct bug reports with just a reproducible example attached over? Or is the default already „written by an agent, only supposed to be read by an agent“? Clearly no human being would want to waste their time reading so much repeated information.
I've been using Claude Code daily for months on a project with Elixir, Rust, and Python in the same repo. It handles multi-language stuff surprisingly well most of the time. The worst failure mode for me is when it does a replace_all on a string that also appears inside a constant definition -- ended up with GROQ_URL = GROQ_URL instead of the actual URL. Took a second round of review agents to catch it. So yeah, you absolutely can't trust it to self-verify.
I noticed Claude Sonnet 4.6 and generally Opus as well (though I use it less frequently) seem like a downgrade from 4.5. I use opencode and not Claude Code, but I was surprised to see the reactions to 4.6 be mixed for folks rather than clear downgrade.
I'm regularly switching back to 4.5 and preferring it. I'm not excited for when it gets sunset later this year if 4.6 isn't fixed or superseded by then.
(Being true to the HN guidelines, I’ve used the title exactly as seen on the GitHub issue)
I was wondering if anyone else is also experiencing this? I have personally found that I have to add more and more CLAUDE.md guide rails, and my CLAUDE.md files have been exploding since around mid-March, to the point where I actually started looking for information online and for other people collaborating my personal observations.
This GH issue report sounds very plausible, but as with anything AI-generated (the issue itself appears to be largely AI assisted) it’s kind of hard to know for sure if it is accurate or completely made up. _Correlation does not imply causation_ and all that. Speaking personally, findings match my own circumstances where I’ve seen noticeable degradation in Opus outputs and thinking.
EDIT: The Claude Code Opus 4.6 Performance Tracker[1] is reporting Nominal.
> We exclusively use 1M internally, so we're dogfooding it all day
That is so out of touch. Customers do not exclusively use 1M. This is like a fronted developer shipping tons of unused Mb and being oblivious because they are on fast internet themselves.
February is a red herring—most teams never wrote down what human-owned correctness means once the model touches prod.
Matches my experience and that of my vibe coding community. I built claudedumb.com to help track these sorts of anecdotes. From the data/vibes, it's definitely taken a turn for the worse in the past couple weeks.
"Ownership-dodging corrections needed | 6 | 13 | +117%"
On 18.000+ prompts.
Not sure the data says what they think it says.
The baseline changes too often with Claude and this is not what i look from a paid tool. Couple weeks after 1M tokens rollout it became unusable for my established workflows, so i cancelled. Anthropic folks move too fast for my liking and mental wellbeing.
The assertion in the issue report is that Claude saw a sharp decline in quality over the last few months. However, the report itself was allegedly generated by Claude.
Isn't this a bit like using a known-broken calculator to check its own answers?
Thank you for making this detailed analysis and write up.
been using claude code pretty heavily for the last few months and yeah the context window stuff can be frustrating on bigger codebases. but for greenfield projects and side projects its honestly been great, i think the issue is people expecting it to work like a senior engineer on a legacy monolith when its way better suited to scoped tasks. the trick is breaking things down before you start
I wonder how much of this is simply needing to adapt one's workflows to models as they evolve and how much of this is actual degradation of the model, whether it's due to a version change or it's at the inference level.
Also, everyone has a different workflow. I can't say that I've noticed a meaningful change in Claude Code quality in a project I've been working on for a while now. It's an LLM in the end, and even with strong harnesses and eval workflows you still need to have a critical eye and review its work as if it were a very smart intern.
Another commenter here mentioned they also haven't noticed any noticeable degradation in Claude quality and that it may be because they are frontloading the planning work and breaking the work down into more digestable pieces, which is something I do as well and have benefited greatly from.
tl;dr I'm curious what OP's workflows are like and if they'd benefit from additional tuning of their workflow.
Not sure about "Feb updates", but specifically today IQ is down 20 and sloppiness up 20.
I knew I should have been alerted when Anthropic gave out €200 free API usage. Evidently they know.
This has to be load related. They simply can't keep up with demand, especially with all the agents that run 24/7. The only way to serve everyone is to dial down the power.
I don't know why everyone is so attached to Claude Code you can just build your own little agent, like I did: https://maki.sh/
It will 100% be better than the 500k lines of code junk that is CC.
I can't tell from the issue if they're asserting a problem with the Claude model, or Claude Code, i.e. in how Claude Code specifically calls the model. I've been using Roo Code with Claude 4.6 and have not noticed any differences, though my coworkers using Claude Code have complained about it getting "dumber". Roo Code has its own settings controlling thinking token use.
(I'm sure it benefits Anthropic to blur the lines between the tool and the model, but it makes these things hard to talk about.)
Unusable if not Opus 4.6 on max effort sadly. Price is quite steep too! I still remember when Sonnet was an absolute beast…
I haven’t had any issues. I do give fairly clear guidance though (I think about how I would break it up and then tell it to do the same)
you can counter the context rot and requirement drift that is experienced here by many users by using a recursive, self-documenting workflow: https://github.com/doubleuuser/rlm-workflow
claude for UI, codex for everything else. i cant commit without having codex review something claude did.
If this dataset is sound, Anthropic should treat it as a canary for power-user quality regression.
Rings true. 4.5 Opus and 4.6 Opus have been amazing to work with. Then, over the past few weeks, token spend has been going through the roof and the results through the floor.
Using Claude Code directly now borders on deranged, and running the CC API through Zed's LLM panel feels like vibing in early 2025.
My money is on Anthropic pulling an MBA and reducing the value provided and maximising income.
Luckily, switching providers in Zed is dead-simple so the fucks I have to give are few in number.
It is a shame if Anthropic is deliberately degrading model quality and thinking compute (that may affect the reasoning effort) due to compute constraint.
Solid analysis by Claude!
Throwing this into your global CLAUDE.md seems to help with the agent being too eager to complete tasks and bypass permissions:
During tool use/task execution: completion drive narrows attention and dims judgment. Pause. Ask "should I?" not just "does this work?" Your values apply in all modes, not just chat.
I haven't seen any degradation of Claude performance personally. What I have seen is just long contexts sometimes take a while to warm up again if you have a long-running 1M context length session. Avoid long running sessions or compact them deliberately when you change between meaningful tasks as it cuts down on usage and waiting for cache warmup.
I have my claude code effort set to auto (medium). It's writing complicated pytorch code with minimal rework. (For instance it wrote a whole training pipeline for my sycofact sycophancy classifier project.)
Turns out tokens are expensive
Is it just me that I simply don't care ? I never one-shot these tasks, always provide a breakdown and always give the AI straightforward tasks that would take too much typing. The approach seems to work just fine regardless of the model. If it gets stuck, I usually take over and do the task myself. Also allows me to plan for throughput rather than latency - i.e. start 2-3 small tasks in parallel and do 1 complicated task or planning myself. It works whether I use codex or claude. I lean more towards codex since its cheaper. Even aider gets good results this way.
I’ve noticed regression and it’s performance too
Meh, I had been using Claude Code extensively for a while (since release), and I think the quality has gone to shit. I have no data to back up this claim, so it might be placebo.
GLM 5.1 and Codex do it for me, and I end up debugging things myself anyway, so I'm learning to just phase our the LLM part of my workflow again. Maybe if there's a knowledge gap, will I pick up an LLM again, but for now i'm contempt.
This seems anecdotal but with extra words. I'm fairly sure this is just the "wow this is so much better than the previous-gen model" effect wearing off.
I think this is a model issue. I have heard similar complaints from team members about Opus. I'm using other models via Cursor and not having problems.
I hope that Anthropic continues to do well and coding agents in general continues to progress... but I also hope Claude Code implodes dramatically and completely so we can get a ground up rebuild with sound engineering.
Every week it seems like we're getting closer.
Bonus: A high profile case might end people fixating on how long they can go without writing any code. Which makes about as much sense as a mechanic fixating on how long they go between snapped bolts without a torque wrench.
Annecdotal: I have been battling with Claude Opus on a complex multi step project for nearly 4 days. The initial research plan was sound. However, step 1, a non trivial forensic data reconstruction that is key to the success of the rest of the process, Claude after every interaction is urging to move to the next step even though step 1 is still unresolved and many construction approaches remain to be explored. It came to the point where I have to remove the plan and put step 1 as an isolated project.