Here’s the problem. The distribution of query difficulty / task complexity is probably heavily right-skewed which drives up the average cost dramatically. The logical thing for anthropic to do, in order to keep costs under control, is to throttle high-cost queries. Claude can only approximate the true token cost of a given query prior to execution. That means anything near the top percentile will need to get throttled as well.
By definition this means that you’re going to get subpar results for difficult queries. Anything too complicated will get a lightweight model response to save on capacity. Or an outright refusal which is also becoming more common.
New models are meaningless in this context because by definition the most impressive examples from the marketing material will not be consistently reproducible by users. The more users who try to get these fantastically complex outputs the more those outputs get throttled.
Reminder that 4.7 may seem like a huge upgrade to 4.6 because they nerfed the F out of 4.6 ahead of this launch so 4.7 would seem like a remarkable improvement...
All fine, where is pelican on bicycle?
> First, Opus 4.7 uses an updated tokenizer that improves how the model processes text
wow can I see it and run it locally please? Making API calls to check token counts is retarded.
Excited to start using this!
Introducing a new upgraded slot machine named "Claude Opus" in the Anthropic casino.
You are in for a treat this time: It is the same price as the last one [0] (if you are using the API.)
But it is slightly less capable than the other slot machine named 'Mythos' the one which everyone wants to play around with. [1]
"Error: claude-opus-4-6[1m] is temporarily unavailable".
Sigh here we go again, model release day is always the worst day of the quarter for me. I always get a lovely anxiety attack and have to avoid all parts of the internet for a few days :/
amazing speed...
even sonnet right now has degraded for me to the point of like ChatGPT 3.5 back then. took ~5 hours on getting a playwright e2e test fixed that waited on a wrong css selector. literlly, dumb as fuck. and it had been better than opus for the last week or so still... did roughly comparable work for the last 2 weeks and it all went increasingly worse - taking more and more thinking tokens circling around nonsense and just not doing 1 line changes that a junior dev would see on the spot. Too used to vibing now to do it by hand (yeah i know) so I kept watching and meanwhile discovered that codex just fleshed out a nontrivial app with correct financial data flows in the same time without any fuzz. I really don't get why antrhopic is dropping their edge so hard now recently, in my head they might aim for increasing hype leading to the IPO, not disappointment crashes from their power user base.
It seems like we're hitting a solid plateau of LLM performance with only slight changes each generation. The jumps between versions are getting smaller. When will the AI bubble pop?
I wonder if this one will be able to stop putting my fucking python imports inline LIKE I'VE TOLD IT A THOUSAND TIMES.
> indeed, during its training we experimented with efforts to differentially reduce these capabilities
can't wait for the chinese models to make arrogant silicon valley irrelevant
We all know this is actually Mythos but called Opus 4.7 to avoid disappointments, right?
[dead]
[flagged]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
TL;DR; iPhone is getting better every year
The surprise: agentic search is significantly weaker somehow hmm...
New model - that explains why for the past week/two weeks I had this feeling of 4.6 being much less "intelligent". I hope this is only some kind of paranoia and we (and investors) are not being played by the big corp. /s
TL;DR; iPhone is getting better every year
The surprise: agentic search is significantly weaker somehow hmm...
The model card confirms the chain-of-thought supervision error from Mythos was present during Opus 4.7 training too, affecting 7.8% of episodes. That's not a one-time bug that got patched. It's a training pipeline issue that persisted across model generations. The long-context regression from 91.9% to 59.2% is also worth noting — they traded retrieval accuracy for coding benchmarks, which is a reasonable engineering choice, but the framing buries it.
> In Claude Code, we’ve raised the default effort level to xhigh for all plans.
Does it also mean faster to getting our of credits?
Codex release coming today: https://x.com/thsottiaux/status/2044803491332526287