As the author of the now (in)famous report in https://github.com/anthropics/claude-code/issues/42796 issue (sorry stella :) all I can say is... sigh. Reading through the changelog felt as if they codified every bad experiment they ran that hurt Opus 4.6. It makes it clear that the degradation was not accidental.
I'm still sad. I had a transformative 6 months with Opus and do not regret it, but I'm also glad that I didn't let hope keep me stuck for another few weeks: had I been waiting for a correction I'd be crushed by this.
Hypothesis: Mythos maintains the behavior of what Opus used to be with a few tricks only now restricted to the hands of a few who Anthropic deems worthy. Opus is now the consumer line. I'll still use Opus for some code reviews, but it does not seem like it'll ever go back to collaborator status by-design. :(
I’ve been using Opus 4.6 extensively inside Claude Code via AWS Bedrock with max effort for a few months now (since release). I’ve found a good “personal harness” and way of working with it in such a way that I can easily complete self contained tasks in my Java codebase with ease.
Now idk if it’s just me or anything else changed, but, in the last 4/5 days, the quality of the output of Opus 4.6 with max effort has been ON ANOTHER LEVEL. ABSOLUTELY AMAZING! It seems to reason deeper, verifies the work with tests more often, and I even think that it compacted the conversations more effectively and often. Somehow even the quality of the English “text” in the output felt definitely superior. More crisp, using diagrams and analogies to explain things in a way that it completely blew me away. I can’t explain it but this was absolutely real for me.
I’d say that I can measure it quite accurately because I’ve kept my harness and scope of tasks and way of prompting exactly the same, so something TRULY shifted.
I wish I could get some empirical evidence of this from others or a confirmation from Boris…. But ISTG these last few days felt absolutely incredible.
If Claude AI is so good at coding, why can't Anthropic use it to improve Claude's uptime and fix the constant token quota issues?
If the model is based on a new tokenizer, that means that it's very likely a completely new base model. Changing the tokenizer is changing the whole foundation a model is built on. It'd be more straightforward to add reasoning to a model architecture compared to swapping the tokenizer to a new one.
Usually a ground up rebuild is related to a bigger announcement. So, it's weird that they'd be naming it 4.7.
Swapping out the tokenizer is a massive change. Not an incremental one.
I wonder why computer use has taken a back seat. Seemed like it was a hot topic in 2024, but then sort of went obscure after CLI agents fully took over.
It would be interesting to see a company to try and train a computer use specific model, with an actually meaningful amount of compute directed at that. Seems like there's just been experiments built upon models trained for completely different stuff, instead of any of the companies that put out SotA models taking a real shot at it.
The adaptive thinking behavior change is a real problem if you're running it in production pipelines. We use claude -p in an agentic loop and the default-off reasoning summary broke a couple of integrations silently — no error, just missing data downstream. The "display": "summarized" flag isn't well surfaced in the migration notes. Would have been nice to have a deprecation warning rather than a behavior change on the same model version.
I'd recommend anyone to ask Claude to show used context and thinking effort on its status line, something like:
``` #!/bin/bash input=$(cat) DIR=$(echo "$input" | jq -r '.workspace.current_dir // empty') PCT=$(echo "$input" | jq -r '.context_window.used_percentage // 0' | cut -d. -f1) EFFORT=$(jq -r '.effortLevel // "default"' ~/.claude/settings.json 2>/dev/null) echo "${DIR/#$HOME/~} | ${PCT}% | ${EFFORT}" ```
Because the TUI it is not consistent when showing this and sometimes they ship updates that change the default.
Using it to build https://rustic-playground.app. Rust + Claude turned out to be a surprisingly good pairing — the compiler catches a whole class of AI slip-ups before they ever run. So far so good!
I've noticed it getting dumber in certain situations , can't point to it directly as of now , but seems like its hallucinating a bit more .. and ditto on the Adaptive thinking being confusing
It's interesting to see Opus 4.7 follow so soon after the announcement of Mythos, especially given that Anthropic are apparently capacity constrained.
Capacity is shared between model training (pre & post) and inference, so it's hard to see Anthropic deciding that it made sense, while capacity constrained, to train two frontier models at the same time...
I'm guessing that this means that Mythos is not a whole new model separate from Opus 4.6 and 4.7, but is rather based on one of these with additional RL post-training for hacking (security vulnerability exploitation).
The alternative would be that perhaps Mythos is based on a early snapshot of their next major base model, and then presumably that Opus 4.7 is just Opus 4.6 with some additional post-training (as may anyways be the case).
Here you go folks:
https://www.svgviewer.dev/s/odDIA7FR
"create a svg of a pelican riding on a bicycle" - Opus 4.7 (adaptive thinking)
> Opus 4.7 is a direct upgrade to Opus 4.6, but two changes are worth planning for because they affect token usage. First, Opus 4.7 uses an updated tokenizer that improves how the model processes text. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type. Second, Opus 4.7 thinks more at higher effort levels, particularly on later turns in agentic settings. This improves its reliability on hard problems, but it does mean it produces more output tokens.
This is concerning & tone-deaf especially given their recent change to move Enterprise customers from $xxx/user/month plans to the $20/mo + incremental usage.
IMO the pursuit of ultraintelligence is going to hurt Anthropic, and a Sonnet 5 release that could hit near-Opus 4.6 level intelligence at a lower cost would be received much more favorably. They were already getting extreme push-back on the CC token counting and billing changes made over the past quarter.
Is Codex the new goto? Opus stopped being useful about 45-60 days ago.
I've always seen people complaining about model getting dumber just before the new one drops and always though this was confirmation bias. But today, several hours before the 4.7 release, opus 4.6 was acting like it was sonnet 2 or something from that era of models.
It didn't think at all, it was very verbose, extremely fast, and it was just... dumb.
So now I believe everyone who says models do get nerfed without any notification for whatever reasons Anthropic considers just.
So my question is: what is the actual reason Anthropic lobotomizes the model when the new one is about to be dropped?
I've taken a two week hiatus on my personal projects, so I haven't experienced any of the issues that have been so widely reported recently with CC. I am eager to get back and see if experience these same issues.
WTF. `Opus 4.7 is the first such model: its cyber capabilities are not as advanced as those of Mythos Preview (indeed, during its training we experimented with efforts to differentially reduce these capabilities). We are releasing Opus 4.7 with safeguards that automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses. `
Seriously? You're degrading Opus 4.7 Cybersecurity performance on purpose. Absolute shit.
Do we have any performance benchmark with token length? Now that the context size is 1 M. I would want to know if I can exhaust all of that or should I clear earlier?
What's the point of baking the best and most impressive models in the world and then serving it with degraded quality a month after releases so that intelligence from them is never fully utilised??
Honestly I've been doing a lot of image-related work recently and the biggest thing here for me is the 3x higher resolution images which can be submitted. This is huge for anyone working with graphs, scientific photographs, etc. The accuracy on a simple automated photograph processing pipeline I recently implemented with Opus 4.6 was about 40% which I was surprised at (simple OCR and recognition of basic features). It'll be interesting to see if 4.7 does much better.
I wonder if general purpose multimodal LLMs are beginning to eat the lunch of specific computer vision models - they are certainly easier to use.
How should one compare benchmark results? For example, SWE-bench Pro improved ~11% compared with Opus 4.6. Should one interpret it as 4.7 is able to solve more difficult problems? or 11% less hallucinations?
I'm an Opus fanboy, but this is literally the worst coding model I have used in 6 months. Its completely unusable and borderline dangerous. It appears to think less than haiku, will take any sort of absurd shortcut to achieve its goal, refuses to do any reasoning. I was back on 4.6 within 2 hours.
Did Anthropic just give up their entire momentum on this garbage in an effort to increase profitability?
Been on 10/15 hours a day sessions since january 31st. Last few days were horrendous. Thinking about dropping 20x.
The most important question is: does it perform better than 4.6 in real world tasks? What's your experience?
> Opus 4.7 introduces a new xhigh (“extra high”) effort level
I hope we standardize on what effort levels mean soon. Right now it has big Spinal Tap "this goes to 11" energy.
Interesting to see the benchmark numbers, though at this point I find these incremental seeming updates hard to interpret into capability increases for me beyond just "it might be somewhat better".
Maybe I've skimmed too quickly and missed it, but does calling it 4.7 instead of 5 imply that it's the same as 4.6, just trained with further refined data/fine tuned to adapt the 4.6 weights to the new tokenizer etc?
Install the latest claude code to use opus 4.7:
`claude install latest`
as every AI provider is pushing news today, just wanted to say that apfel is v1.0.4 stable today https://github.com/Arthur-Ficial/apfel
Will they actually give you enough usage ? Biggest complaint is that codex offers way more weekly usage. Also this means GPT 5.5 release is imminent (I suspect thats what Elephant is on OR)
I am waiting for the 2x usage window to close to try it out today.
If they are charging 2x usage during the most important part of the day, doesn't this give OpenAI a slight advantage as people might naturally use Codex during this period?
> We are releasing Opus 4.7 with safeguards that automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses.
Fucking hell.
Opus was my go-to for reverse engineering and cybersecurity uses, because, unlike OpenAI's ChatGPT, Anthropic's Opus didn't care about being asked to RE things or poke at vulns.
It would, however, shit a brick and block requests every time something remotely medical/biological showed up.
If their new "cybersecurity filter" is anywhere near as bad? Opus is dead for cybersec.
The benchmarks of Opus 4.6 they compare to MUST be retaken the day of the new model release. If it was nerfed we need to know how much.
I get a little sad with every new Claude release. Sonnet 4.5 is my favorite and each new model means it's one step closer to being retired. Nothing else replaces it for me
4.6 vastly outperforms 4.7 in my not so typical application - generating explanations of phrases and words for Chinese learners (simplifying). Robust complex long prompt tested on many different models. That's interesting.
Opus 4.7 came even quicker than I expected. It's like they are releasing a new Opus to distract us from Mythos that we all really want.
Just before the end is this one-liner:
> the same input can map to more tokens—roughly 1.0–1.35× depending on the content type
Does this mean that we get a 35% price increase for a 5% efficiency gain? I'm not sure that's worth it.
Interesting that the MCP-Atlas score for 4.6 jumped to 75.8% compared to 59.5% https://www.anthropic.com/news/claude-opus-4-6
There's other small single digit differences, but I doubt that the benchmark is that unreliable...?
With the new tokenizer did they A/B test this one?
I'm curious if that might be responsible for some of the regressions in the last month. I've been getting feedback requests on almost every session lately, but wasn't sure if that was because of the large amount of negative feedback online.
This new one seems even pushier to shove me on the shortest-path solution
7 trivial prompts, and at 100% limit, using sonnet, not Opus this morning. Basically everyone at our company reporting the same use pattern. Support agent refuses to connect me to a human and terminated the conversation, I can't even get any other support because when I click "get help" (in Claude Desktop) it just takes me back to the agent and that conversation where fin refuses to respond any more.
And then on my personal account I had $150 in credits yesterday. This morning it is at $100, and no, I didn't use my personal account, just $50 gone.
Commenting here because this appears to be the only place that Anthropic responds. Sorry to the bored readers, but this is just terrible service.
What a joke Opus 4.7 at max is.
I gave it an agentic software project to critically review.
It claimed gemini-3.1-pro-preview is wrong model name, the current is 2.5. I said it's a claim not verified.
It offered to create a memory. I said it should have a better procedure, to avoid poisoning the process with unverified claims, since memories will most likely be ignored by it.
It agreed. It said it doesn't have another procedure, and it then discovered three more poisonous items in the critical review.
I said that this is a fabrication defect, it should not have been in production at all as a model.
It agreed, it said it can help but I would need to verify its work. I said it's footing me with the bill and the audit.
We amicably parted ways.
I would have accepted a caveman-style vocabulary but not a lobotomized model.
I'm looking forward to LobotoClaw. Not really.
Based on last few attemts on claude code to address a docker build issue this feels like a downgrade
How powerful will Opus become before they decide to not release it publicly like Mythos?
Looks completely broken on AWS Bedrock
"errorCode": "InternalServerException", "errorMessage": "The system encountered an unexpected error during processing. Try your request again.",
Am I going to have to make it rewrite all the stuff 4.6 did?
if Opus 4.7 or Mythos are so good how come Claude has some of the worst uptime in most online services?
Claude Code hasn't updated yet it seems, but I was able to test it using `claude --model claude-opus-4-7`
Or `/model claude-opus-4-7` from an existing session
edit: `/model claude-opus-4-7[1m]` to select the 1m context window version
Anthropic shouldn't have released it. The gains are marginal at best. This release feels more like Opus 4.6 with better agentic capabilities. Mythos is what I expected Opus 4.7 to be. Are users gonna be charged more with this release, for such marginal gains. It could set a bad precedent.