Why I do not believe this shows Anthropic serves folks a worse model:
1. The percentage drop is too low and oscillating, it goes up and down.
2. The baseline of Sonnet 4.5 (the obvious choice for when they have GPU busy for the next training) should be established to see Opus at some point goes Sonnet level. This was not done but likely we would see a much sharp decline in certain days / periods. The graph would look like dominated by a "square wave" shape.
3. There are much better explanations for this oscillation: A) They have multiple checkpoints and are A/B testing, CC asks you feedbacks about the session. B) Claude Code itself gets updated, as the exact tools version the agent can use change. In part it is the natural variability due to the token sampling that makes runs not equivalent (sometimes it makes suboptimal decisions compared to T=0) other than not deterministic, but this is the price to pay to have some variability.
I too suspect the A/B testing is the prime suspect: context window limits, system prompts, MAYBE some other questionable things that should be disclosed.
Either way, if true, given the cost I wish I could opt-out or it were more transparent.
Put out variants you can select and see which one people flock to. I and many others would probably test constantly and provide detailed feedback.
All speculation though
4. The graph starts January 8.
Why January 8? Was that an outlier high point?
IIRC, Opus 4.5 was released late november.
It would be very easy for them to switch the various (compute) cost vs performance knobs down depending on load to maintain a certain latency; you would see oscillations like this, especially if the benchmark is not always run exactly at the same time every day.
& it would be easy for them to start with a very costly inference setup for a marketing / reputation boost, and slowly turn the knobs down (smaller model, more quantized model, less thinking time, fewer MoE experts, etc)
> 1. The percentage drop is too low and oscillating, it goes up and down.
How do you define “too low”, they make sure to communicate about the statistical significance of their measurements, what's the point if people can just claim it's “too low” based on personal vibes…
I believe the science, but I've been using it daily and it's been getting worse, noticeably.