[SWE-bench co-author here] It seems like they run this test on a subset of 50 tasks, and that they only run the test once per day. So a lot of the movement in accuracy could be attributed to that. I would run on 300 tasks and I'd run the test suite 5 or 10 times per day and average that score. Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded.
Why I do not believe this shows Anthropic serves folks a worse model:
1. The percentage drop is too low and oscillating, it goes up and down.
2. The baseline of Sonnet 4.5 (the obvious choice for when they have GPU busy for the next training) should be established to see Opus at some point goes Sonnet level. This was not done but likely we would see a much sharp decline in certain days / periods. The graph would look like dominated by a "square wave" shape.
3. There are much better explanations for this oscillation: A) They have multiple checkpoints and are A/B testing, CC asks you feedbacks about the session. B) Claude Code itself gets updated, as the exact tools version the agent can use change. In part it is the natural variability due to the token sampling that makes runs not equivalent (sometimes it makes suboptimal decisions compared to T=0) other than not deterministic, but this is the price to pay to have some variability.
> We model tests as Bernoulli random variables and compute 95% confidence intervals around daily, weekly, and monthly pass rates. Statistically significant differences in any of those time horizons are reported.
They're going to need to provide a lot more detail on their methodology, because that doesn't make a lot of sense. From their graphs, they seem to be calculating the confidence interval around the previous value, then determining whether the new value falls outside of it. But that's not valid for establishing the statistical significance of a difference. You need to calculate the confidence interval of the difference itself, and then see if all the values within that confidence interval remain positive (if it excludes 0). This is because both the old and new measurement have uncertainty. Their approach seems to be only considering uncertainty for one of them.
They should also really be more specific about the time periods. E.g. their graphs only show performance over the past 30 days, but presumably the monthly change is comparing the data from 60 to 31 days ago, to the data from 30 days ago until yesterday? In which case the weekly graph really ought to be displaying the past two months, not one month.
I’ve noticed Claude has been noticeably worse over the last week. For example, it told me I should pass frozen to make my Enum immutable—that’s not a thing. (It is a thing for dataclasses, but not for Enums.) That’s a pretty basic language feature it was nailing until recently. It also suggested I parse a URL using urlparse in a function that already uses urlparse. These are basic mistakes it wasn’t making before. Something seems to have changed, but I’m not sure what.
There was a moment about a week ago where Claude went down for about an hour. And right after it came back up it was clear a lot of people had given up and were not using it.
It was probably 3x faster than usual. I got more done in the next hour with it than I do in half a day usually. It was definitely a bit of a glimpse into a potential future of “what if these things weren’t resource constrained and could just fly”.
Simply search user prompts for curse words and then measure hostility sentiment. User hostility rises as agents fail to meet expectations.
Tracking benchmarks for AI-assisted coding tools is crucial. It helps developers understand the trade-offs and stability of the models they rely on.
Wouldn't be surprised if they slowly start quantizing their models over time. Makes it easier to scale and reduce operational cost. Also makes a new release have more impact as it will be more notably "better" than what you've been using the past couple of days/weeks.
I am using API mode, and it's clear that there are times when the Claude model just gives up. And it is very noticeable because the model just does the most dumb things possible.
"You have a bug in line 23." "Oh yes, this solution is bugged, let me delete the whole feature." That one-line fix I could make even with ChatGPT 3.5 can't just happen. Workflows that I use and are very reproducible start to flake and then fail.
After a certain number of tokens per day, it becomes unusable. I like Claude, but I don't understand why they would do this.
Lack of transparency as regards "thinking power"-consistency is a big gripe of mine with LLM providers. It's even worse with ChatGPT and the like. E.g. I had to learn the hard way that at >45k input tokens ChatGPT 5.2 Thinking Extended bumps its intelligence down so hard that it can't follow basic instructions (or it somehow truncates the input, losing the instructions). It sucks to lose confidence in an otherwise great tool. I would 100x prefer being forced to back-off, or getting a straight-no, than getting silently downgraded. Transparency is a big deal.
Benchmark tracking of cloud AI performance is going to be crucial going forward. Vendors are selling a service that by its nature is very difficult for customers to gauge day to day. How will I know if a code revision is ~2.5% less good today than it would have been yesterday? Or if queries during peak load hours use one less 'expert' in their MoE?
Yet vendor's costs to deliver these services are skyrocketing, competition is intense and their ability to subsidize with investor capital is going away. The pressure on vendors to reduce costs by dialing back performance a few percent or under-resourcing peak loads will be overwhelming. And I'm just a hobbyist now. If I was an org with dozens or hundreds of devs I'd want credible ways to verify the QoS and minimum service levels I'm paying for are being fulfilled long after a vendor has won the contract.
FYI the MarginLab Claude Code degradation tracker is showing a statistically significant ~4% drop in SWE-Bench-Pro accuracy over the past month
I really like the idea, but a "±14.0% significance threshold" is meaningless here.
The larger monthly scale should be the default, or you should get more samples.
This is super important - even if it's not currently the best measure of degradation yet. Anecdotally, Opus 4.5 has gotten so bad for me it's almost adding time to my workflow instead saving it. It'd be nice to have more 3rd party measurements like this to hold Anthropic accountable.
New to me, but I am starting to infer that for those "in the know" it is common knowledge on HN that LLMs are purposely degraded over time to manage capacity/cost or fudge benchmarks...
How do you actually use these in production pipelines in practice then?
Are LLMs even well suited for some of the document parsing / data scrubbing automation people are throwing at them now?
Please try to make this statistically rigorous. There's lots of advice in this thread (intraday variation, etc) but if Im reading this right it looks like the CI includes the baseline value yet you still label this as failing.
Wouldn't this just be "our test isn't powerful enough to find a signal if there were one here?"
People will see this and derive strong conclusions that the data don't support and you, `qwesr123`, or "JB" from your blogs, will be responsible.
I’d love to see, based on the level of non-determinism perfomance on the benchmark how many times you need to run the benchmark for the change to be relevant (or statistically significant if you want).
That would be a nice paper.
What would be cool if this somehow could do a comparison by provider. E.g. in the last outages anthropic models running on vertex were apparently less affected than those deployed elsewhere. (Not saying that one is better than the other, but would be a neat read out).
Could this be (partially?) explained by Model Collapse [1], i.e. iteratively training on data that includes an ever increasing amount of AI slop?
[1] https://thebullshitmachines.com/lesson-16-the-first-step-fal...
I hope the author sees this:
You have to test inter-day variation. Many have noticed a sudden drop off at certain times.
Does this use a claude subscription or key, and has the account been used for anything else that day?
On HN a few days ago there was a post suggesting that Claude gets dumber throughout the day: https://bertolami.com/index.php?engine=blog&content=posts&de...
Does it benchmark the underlying code (Opus 4.5) or Claude Code harness? If the second, I would love to see CC versions involved.
I would be curious to see on how it fares against a constant harness.
There were thread claiming that Claude Code got worse with 2.0.76, with some people going back to 2.0.62. https://github.com/anthropics/claude-code/issues/16157
So it would be wonderful to measure these.
This strategy seems inspired by TikTok's approach for retaining new uploaders.
TikTok used to give new uploaders a visibility boost (i.e., an inflated number of likes and comments) on their first couple of uploads, to get them hooked on the the service.
In Anthropic/Claude's case, the strategy is (allegedly) to give new users access to the premium models on sign-up, and then increasingly cut the product with output from cheaper models.
What makes the level they chose a “baseline,” against which it would be appropriate to do statistical tests?
First off, this is a cool project, look forward to some interesting insights.
I would suggest adding some clarification to note that longer measure like 30 pass rate is raw data only while the statistically significant labels apply only to change.
Maybe something like Includes all trials, significance labels apply only to confidence in change vs baseline.
Codex is doing better. Why is everyone silent on Codex? https://marginlab.ai/trackers/codex/
Very interesting. I would be curious to understand how granular these updates are being applied to CC + what might be causing things like this. I feel like I can notice a very small degradation but have compensated with more detailed prompts (which I think, perhaps naively, is offsetting this issue).
I KNEW I WASNT CRAZY
they should run their test against a control baseline such as an open source hosted model to see the overall drift in their test
> We model tests as Bernoulli random variables and compute 95% confidence intervals around daily, weekly, and monthly pass rates. Statistically significant differences in any of those time horizons are reported.
Doesn't really work like that. I'd remove the "statistically significant" labelling because it's misleading.
I have yet to experience any degradation in coding tasks I use to evaluate Opus 4.5, but I did see a rather strange and reproducible worsening in prompt adherence as part of none coding tasks since the third week of January.
Very simple queries, even those easily answered via regular web searching, have begun to consistently not result accurate results with Opus 4.5, despite the same prompts previously yielding accurate results.
One of the tasks that I already thought was fully saturated as most recent releases had no issues in solving it was to request a list of material combinations for fabrics used in bag constructions that utilise a specific fabric base. In the last two weeks, Claude has consistently and reproducibly provided results which deviate from the requested fabric base, making the results inaccurate in a way that a person less familiar with the topic may not notice instantly. There are other queries of this type for other topics I am nerdily familiar with to a sufficient degree to notice such deviations from the prompt like motorcycle history specific queries that I can say this behaviour isn't limited to the topic of fabrics and bag construction.
Looking at the reasoning traces, Opus 4.5 even writes down the correct information, yet somehow provides an incorrect final output anyways.
What makes this so annoying is that in coding tasks, with extensive prompts that require far greater adherence to very specific requirements in a complex code base, Opus 4.5 does not show such a regression.
I can only speculate what may lead to such an experience, but for none coding tasks I have seen regression in Opus 4.5 whereas for coding I did not. Not saying there is none, but I wanted to point it out as such discussions are often primarily focused on coding, where I find it can be easier to see potential regressions where their are none as a project goes on and tasks become inherently more complex.
My coding benchmarks are a series of very specific prompts modifying a few existing code bases in some rather obscure ways, with which I regularly check whether a model does severely deviate from what I'd seen previously. Each run starts with a fresh code base with some fairly simple tasks, then gets increasingly complex with later prompts not yet being implemented by any LLM I have gotten to test. Partly that originated from my subjective experience with LLMs early on, where I found a lot of things worked very well but then as the project went on and I tried more involved things with which the model struggled, I felt like the model was overall worse when in reality, what had changed were simply the requirements and task complexity as the project grew and easier tasks had been completed already. In this type of testing, Opus 4.5 this week got as far and provided a result as good as the model did in December. Of course, past regressions were limited to specific users, so I am not saying that no one is experiencing reproducible regressions in code output quality, merely that I cannot reproduce them in my specific suite.
would be interesting to see what scores it's get when it is actually degraded via the status page, it gets degraded pretty often, so there's at least something to compare or to know at what point Anthropic declares degradation
Would love to see this idea expanded to ever alleged SoTA model currently in production. Any speculation as to why this degradation occurs?
The chart would benefit from having weekends highlighted. Or have another chart averaged by a weekday.
In medicine there is a concept of reporting adverse effects of medication or interventions which are then collectively studied for Public Health [MedWatch][VAERS][EudraVigilance] and in academia. We should have something like that for all coding agents(and agents in other fields too), given how widely its deployed and affect on "health" in general(not only human). Call it the AI "health" of things benchmark.
I would imagine a sort of hybrid qualities of volunteer efforts like wikipedia, new problems like advent of code and benchmarks like this. The goal? It would be to study the collective effort on the affects of usage to so many areas where AI is used.
[MedWatch](https://www.fda.gov/safety/medwatch-fda-safety-information-a...)
[VAERS](https://www.cdc.gov/vaccine-safety-systems/vaers/index.html)
[EudraVigilance](https://www.ema.europa.eu/en/human-regulatory-overview/resea...)
My personal conspiracy theory is that they choose who to serve a degraded model to based on social graph analysis and sentiment analysis, maximizing for persuasion while minimizing compute.
I’m sure there is not enough data here for this to be statistically significant (it seems to oscillate too much and not show real trends or step changes) - BUT
If this measure were hardened up a little, it would be really useful.
It feels like an analogue to an employee’s performance over time - you could see in the graphs when Claude is “sick” or “hungover”, when Claude picks up a new side hustle and starts completely phoning it in, or when it’s gunning for a promotion and trying extra hard (significant parameter changes). Pretty neat.
Obviously the anthropomorphising is not real, but it is cool to think of the model’s performance as being a fluid thing you have to work with, and that can be measured like this.
I’m sure some people, most, would prefer that the model’s performance were fixed over time. But come on, this is way more fun.
Finally someone did it! We need this for all models.
That will be great if there's RSS support.
This is why I run my own models. All the inference providers do sneaky things behind the scenes. They will limit the output tokens, turn off attention layers, lower reasoning, or just use a completely different model. I'm actually surprised that Claude Code experienced this, as I've experienced this the least from API and coding agents.
Pretty sure someone at Google, OpenAI, and Anthropic met up at a park, leaving their phones in their car, and had a conversation that January 2026, they were all going to silently degrade their models.
They were fighting an arms race that was getting incredibly expensive and realized they could get away with spending less electricity and there was nothing the general population could do about it.
Grok/Elon was left out of this because he would leak this idea at 3am after a binge.
any chance we can get something like this for codex cli that'd be cool too compare
Call it what you will. But the experience is like you have a reliable coworker, but he randomly decides to take bong hits.
"No no yeah bro no I'm good like really the work's done and all yeah sorry I missed that let me fix it"
I wonder when I experience noticeably degraded model quality, ie opus, is it because my usage falls in the highest buckets and I’m being shadow limited or served worse versions of opus or is it because of actual server load/burden?
It wouldn’t be the first time companies have secret shadow algorithms running to optimize things and wouldn’t it be obvious to limit power users as matter of cost/profit and not tell them. (See history of “Shadow ban” though that’s for different reasons)
This is probably entirely down to subtle changes to CC prompts/tools.
I've been using CC more or less 8 hrs/day for the past 2 weeks, and if anything it feels like CC is getting better and better at actual tasks.
Edit: Before you downvote, can you explain how the model could degrade WITHOUT changes to the prompts? Is your hypothesis that Opus 4.5, a huge static model, is somehow changing? Master system prompt changing? Safety filters changing?
[dead]
[dead]
Hi everyone, Thariq from the Claude Code team here.
Thanks for reporting this. We fixed a Claude Code harness issue that was introduced on 1/26. This was rolled back on 1/28 as soon as we found it.
Run `claude update` to make sure you're on the latest version.