This is gonna be game-changing for the next 2-4 weeks before they nerf the model.
Then for the next 2-3 months people complaining about the degradation will be labeled “skill issue”.
Then a sacrificial Anthropic engineer will “discover” a couple obscure bugs that “in some cases” might have lead to less than optimal performance. Still largely a user skill issue though.
Then a couple months later they’ll release Opus 4.7 and go through the cycle again.
My allegiance to these companies is now measured in nerf cycles.
I’m a nerf cycle customer.
I've played around with Gemini 3 Pro in Cursor, and honestly: I find it to be significantly worse than Sonnet 4.5. I've also had some problems that only Claude Code has been able to really solve; Sonnet 4.5 in there consistently performs better than Sonnet 4.5 anywhere else.
I think Anthropic is making the right decisions with their models. Given that software engineering is probably one of the very few domains of AI usage that is driving real, serious revenue: I have far better feelings about Anthropic going into 2026 than any other foundation model. Excited to put Opus 4.5 through its paces.
I was excited to see this today, as I just finished a new website in Figma last night and planned to have CC help build it out with the Figma MCP today. CC started up with Opus 4.5 and... Not off to a good start. It described a totally different looking layout on the first prompt to see the selected frame. I asked why:
You're right to call that out. Looking back at what happened:
1. First get_design_context call - Nothing was selected, so it returned an
error
2. Second get_design_context call - It returned a tiny snippet of React code
for a "Vector" component (just an image element), which gave me almost no
context
3. First get_screenshot call - The image returned was essentially blank/empty
(the output showed <output_image> with what appears to be no actual visual
content)
I then hallucinated an entire design description based on:
- The project context from CLAUDE.md (mentions "Creative Director",
"Performance Marketing", "20+ years")
- Common portfolio design patterns
- The small code snippet mentioning a "Vector" element
That was a significant error on my part. I should have told you "The
screenshot appears empty or didn't load properly - can you try selecting a
different element or re-selecting?" instead of fabricating a description.Tested this today for implementing a new low-frequency RFID protocol to Flipper Zero codebase based on a Proxmark3 implementation. Was able to do it in 2 hours with giving a raw psk recording alongside of it and some troubleshooting. This is the kind of task the last generation of frontier models was incapable of doing. Super stoked to use this :)
The Claude Opus 4.5 system card [0] is much more revealing than the marketing blog post. It's a 150 page PDF, with all sorts of info, not just the usual benchmarks.
There's a big section on deception. One example is Opus is fed news about Anthropic's safety team being disbanded but then hides that info from the user.
The risks are a bit scary, especially around CBRNs. Opus is still only ASL-3 (systems that substantially increase the risk of catastrophic misuse) and not quite at ASL-4 (uplifting a second-tier state-level bioweapons programme to the sophistication and success of a first-tier one), so I think we're fine...
I've never written a blog post about a model release before but decided to this time [1]. The system card has quite a few surprises, so I've highlighted some bits that stood out to me (and Claude, ChatGPT and Gemini).
[0] https://www.anthropic.com/claude-opus-4-5-system-card
[1] https://dave.engineer/blog/2025/11/claude-opus-4.5-system-ca...
Seeing these benchmarks makes me so happy.
Not because I love Anthropic (I do like them) but because it's staving off me having to change my Coding Agent.
This world is changing fast, and both keeping up with State of the Art and/or the feeling of FOMO is exhausting.
Ive been holding onto Claude Code for the last little while since Ive built up a robust set of habits, slash commands, and sub agents that help me squeeze as much out of the platform as possible.
But with the last few releases of Gemini and Codex I've been getting closer and closer to throwing it all out to start fresh in a new ecosystem.
Thankfully Anthropic has come out swinging today and my own SOP's can remain in tact a little while longer.
Notes and two pelicans: https://simonwillison.net/2025/Nov/24/claude-opus/
A really great way to get an idea of the relative cost and performance of these models at their various thinking budgets is to look at the ARC-AGI-2 leaderboard. Opus 4.5 stacks up very well here when you compare to Gemini 3’s score and cost. Gemini 3 Deep Think is still the current leaders but at more than 30x the cost.
The cost curve of achieving these scores is coming down rapidly. In Dec 2024 when OpenAI announced beating human performance on ARC-AGI-1, they spent more than $3k per task. You can get the same performance for pennies to dollars, approximately an 80x reduction in 11 months.
Did anyone else notice Sonnet 4.5 being much dumber recently? I tried it today and it was really struggling with some very simple CSS on a 100-line self-contained HTML page. This never used to happen before, and now I'm wondering if this release has something to do with it.
On-topic, I love the fact that Opus is now three times cheaper. I hope it's available in Claude Code with the Pro subscription.
EDIT: Apparently it's not available in Claude Code with the Pro subscription, but you can add funds to your Claude wallet and use Opus with pay-as-you-go. This is going to be really nice to use Opus for planning and Sonnet for implementation with the Pro subscription.
However, I noticed that the previously-there option of "use Opus for planning and Sonnet for implementation" isn't there in Claude Code with this setup any more. Hopefully they'll implement it soon, as that would be the best of both worlds.
EDIT 2: Apparently you can use `/model opusplan` to get Opus in planning mode. However, it says "Uses your extra balance", and it's not clear whether it means it uses the balance just in planning mode, or also in execution mode. I don't want it to use my balance when I've got a subscription, I'll have to try it and see.
EDIT 3: It looks like Sonnet also consumes credits in this mode. I had it make some simple CSS changes to a single HTML file with Opusplan, and it cost me $0.95 (way too much, in my opinion). I'll try manually switching between Opus for the plan and regular Sonnet for the next test.
On my Max plan, Opus 4.5 is now the default model! Until now I used Sonnet 4.5 exclusively and never used Opus, even for planning - I'm shocked that this is so cheap (for them) that it can be the default now. I'm curious what this will mean for the daily/weekly limits.
A short run at a small toy app makes me feel like Opus 4.5 is a bit slower than Sonnet 4.5 was, but that could also just be the day-one load it's presumably under. I don't think Sonnet was holding me back much, but it's far too early to tell.
> Pricing is now $5/$25 per million [input/output] tokens
So it’s 1/3 the price of Opus 4.1…
> [..] matches Sonnet 4.5’s best score on SWE-bench Verified, but uses 76% fewer output tokens
…and potentially uses a lot less tokens?
Excited to stress test this in Claude Code, looks like a great model on paper!
I used Gemini instead of my usual Claude for a non-trivial front-end project [1] and it really just hit it out of the park especially after the update last week, no trouble just directly emitting around 95% of the application. Now Claude is back! The pace of releases and competition seems to be heating up more lately, and there is absolutely no switching cost. It's going to be interesting to see if and how the frontier model vendors create a moat or if the coding CLIs/models will forever remain a commodity.
I'm on a Claude Code Max subscription. Last days have been a struggle with Sonnet 4.5 - Now it switched to Claude Opus 4.5 as default model. Ridiculous good and fast.
Ok, the victorian lock puzzle game is pretty damn cool way to showcase the capabilities of these models. I kinda want to start building similar puzzle games for models to solve.
I wish it was open-weights so we could discuss the architectural changes. This model is about twice as fast as 4.1, ~60t/s Vs ~30t/s. Is it half the parameters, or a new INT4 linear sparse-moe architecture?
We've added support for opus 4.5 to v0 and users are making some pretty impressive 1-shots:
https://x.com/mikegonz/status/1993045002306699704
https://x.com/MirAI_Newz/status/1993047036766396852
https://x.com/rauchg/status/1993054732781490412
It seems especially good at threejs / 3D websites. Gemini was similarly good at them (https://x.com/aymericrabot/status/1991613284106269192); maybe the model labs are focusing on this style of generation more now.
Would love to know what's going on with C++ and PHP benchmarks. No meaningful gain over Opus 4.1 for either, and Sonnet still seems to outperform Opus on PHP.
Does anyone here understand "interleaved scratchpads" mentioned at the very bottom of the footnotes:
> All evals were run with a 64K thinking budget, interleaved scratchpads, 200K context window, default effort (high), and default sampling settings (temperature, top_p).
I understand scratchpads (e.g. [0] Show Your Work: Scratchpads for Intermediate Computation with Language Models) but not sure about the "interleaved" part, a quick Kagi search did not lead to anything relevant other than Claude itself :)
Why do they always cut off 70% of the y-axis? Sure it exaggerates the differences, but... it exaggerates the differences.
And they left Haiku out of most of the comparisons! That's the most interesting model for me. Because for some tasks it's fine. And it's still not clear to me which ones those are.
Because in my experience, Haiku sits at this weird middle point where, if you have a well defined task, you can use a smaller/faster/cheaper model than Haiku, and if you don't, then you need to reach for a bigger/slower/costlier model than Haiku.
Does anyone know or have a guess on the size of this latest thinking models and what hardware they use to run inference? As in how much memory and what quantization it uses and if it's "theoretically" possible to run it on something like Mac Studio M3 Ultra with 512GB RAM. Just curious from theoretical perspective.
Great seeing the price reduction. Opus historically was prices at 15/75, this one delivers at 5/25 which is close to Gemini 3 Pro. I hope Anthropic can afford increasing limits for the new Opus.
Oh boy, if the benchmarks are this good and Opus feels like it usually does then this is insane.
I’ve always found Opus significantly better than the benchmarks suggested.
LFG
One thing I didn't see mentioned is raw token gen speed compared to the alternatives. I am using Haiku 4.5 because it is cheap (and so am I) but also because it is fast. Speed is pretty high up in my list of coding assistant features and I wish it was more prominent in release info.
great, paying $100/m for claude code, this stops me from switching to gemini 3.0 for now.
What causes the improvements in new AI models recently? Is it just more training, or is it new, innovative techniques?
Some early visual evaluations: https://x.com/mutewinter/status/1993037630209192276
“For Max and Team Premium users, we’ve increased overall usage limits, meaning you’ll have roughly the same number of Opus tokens as you previously had with Sonnet.” — seems like anthropic has finally listened!
With less token usage, cheaper pricing, and enhanced usage limits for Opus, Anthropic are taking the fight to Gemini and OpenAI Codex. Coding agent performance leads to better general work and personal task performance, so if Anthropic continue to execute well on ergonomics they have a chance to overcome their distribution disadvantages versus the other top players.
I've almost ran out of Claude on the Web credits. If they announce that they're going to support Opus then I'm going to be sad :'(
Still mad at them because they decided not to take their users' privacy serious. Would be interested how the new model behaves, but just have a mental lock and can't sign up again.
SWE's results were actually very close, but they used a poor marketing visualization. I know this isn't a research paper, but for Anthropic, I expect more.
I wonder what this means for UX designers like myself who would love to take a screen from Figma and turn it into code with just a single call to the MCP. I've found that Gemini 3 in Figma Make works very well at one-shotting a page when it actually works (there's a lot of issues with it actually working, sadly), so hopefully Opus 4.5 is even better.
Anecdotally, I’ve been using opus 4.5 today via the chat interface to review several large and complex interdependent documents, fillet bits out of them and build a report. It’s very very good at this, and much better than opus 4.1. I actually didn’t realise that I was using opus 4.5 until I saw this thread.
Does it follow directions? I’ve found Sonnet 4.5 to be useless for automated workflows because it refuses to follow directions. I hope they didn’t take the same RLHF approach they did with that model.
The real question I have after seeing the usage rug being pulled is what this costs and how usable this ACTUALLY is with a Claude Max 20x subscription. In practice, Opus is basically unusable by anyone paying enterprise-prices. And the modification of "usage" quotas has made the platform fundamentally unstable, and honestly, it left me personally feeling like I was cheated by Anthropic...
again the question of concern as codex user is usage
its hard to get any meaningful use out of claude pro
after you ship a few features you are pretty much out of weekly usage
compared to what codex-5.1-max offers on a plan that is 5x cheaper
the 4~5% improvement is welcome but honestly i question whether its possible to get meaningful usage out of it the way codex allows it
for most use cases medium or 4.5 handles things well but anthropic seems to have way less usage limits than what openai is subsidizing
until they can match what i can get out of codex it won't be enough to win me back
edit: I upgraded to claude max! read the blog carefully and seems like opus 4.5 is lifted in usage as well as sonnet 4.5!
slightly better at react and spacial logic than gemini 3 pro, but slower and way more expensive.
Has there been any announcement of a new programming benchmark? SWE looks like it's close to saturation already. At this point for SWE it may be more interesting to start looking at which types of issues consistently fail/work between model families.
Up until today, the general advice was use Opus for deep research, use Haiku for everything else. Given the reduction in cost here, does that rule of thumb no longer apply?
Oh that's why there were only 2 usage bars.
Does the reduced price mean increased usage limits on Claude Code (with a Max subscription)?
What surprise me is that Opus 4.5 lost all reasoning scores to Gemini and GPT. I thought it’s the area the model will shine the most
It's really hard for me to take these benchmarks seriously at all, especially that first one where Sonnet 4.5 is better at software engineering than Opus 4.1.
It is emphatically not, it has never been, I have used both models extensively and I have never encountered a single situation where Sonnet did a better job than Opus. Any coding benchmark that has Sonnet above Opus is broken, or at the very least measuring things that are totally irrelevant to my usecases.
This in particular isn't my "oh the teachers lie to you moment" that makes you distrust everything they say, but it really hammers the point home. I'm glad there's a cost drop, but at this point my assumption is that there's also going to be a quality drop until I can prove otherwise in real world testing.
They lowered the price because this is a massive land grab and is basically winner take all.
I love that Antrhopic is focused on coding. I've found their models to be significantly better at producing code similar to what I would write, meaning it's easy to debug and grok.
Gemini does weird stuff and while Codex is good, I prefer Sonnet 4.5 and Claude code.
This is great. Sonnet 4.5 has degraded terribly.
I can get some useful stuff from a clean context in the web ui but the cli is just useless.
Opus is far superiour.
Today sonnet 4.5 suggested to verify remote state file presence by creating an empty one locally and copy it to the remote backend. Da fuq? University level programmer my a$$.
And it seems like it has degraded this last month.
I keep getting braindead suggestions and code that looks like it came from a random word generator.
I swear it was not that awful a couple of months ago.
Opus cap has been an issue, happy to change and I really hope the nerf rumours are just that. Undounded rumours and the defradation has a valid root cause
But honestly sonnet 4.5 has started to act like a smoking pile of sh**t
Ok, but can it play Factorio?
I'm curious if others are finding that there's a comfort in staying within the Claude ecosystem because when it makes a mistake, we get used to spotting the pattern. I'm finding that when I try new models, their "stupid" moments are more surprising and infuriating.
Given this tech is new, the experience of how we relate to their mistakes is something I think a bit about.
Am I alone here, are others finding themselves more forgiving of "their preferred" model provider?
The burying of the lede here is insane. $5/$25 per MTok is a 3x price drop from Opus 4. At that price point, Opus stops being "the model you use for important things" and becomes actually viable for production workloads.
Also notable: they're claiming SOTA prompt injection resistance. The industry has largely given up on solving this problem through training alone, so if the numbers in the system card hold up under adversarial testing, that's legitimately significant for anyone deploying agents with tool access.
The "most aligned model" framing is doing a lot of heavy lifting though. Would love to see third-party red team results.