let me guess, "this is our best model yet"
They must have been A/B testing this with 4.7 lately, I noticed it changed from its normal mode in a way that matches a lot the just released 4.8
when will we get anything for sonnet or haiku? the market for less-capable but cheaper models seems to be completely ignored nowadays
It refused to work for me. Literally said, you can google it. AGI achieved it seems
I believe analogy with smartphone will be best for this case.
In 2010s iphone was the king, all those Chinese devices ware cheaper but not even close to smoothnest and usability of US tech, now after 15 years later everything is changed, now iphone feels like old grandpa to Chinese tech. Same will happend to LLM's just much faster.
I find it surprising that the gap between tool usage and non-tool usage in HLE is relatively small (~10%) but the absolute numbers continue to go up
https://marginlab.ai/trackers/claude-code/
Is it a coincidence that 4.7 was seemingly quantized over past 7 days?
The rapid release cadence and rate of innovation of Anthropic (and OpenAI) is impressive. And obviously it's because these are startups solely dedicated to AI so they can move quickly. Big Tech (like Google) won't be able to keep up with the pace of them (too much bureaucracy and red tape at Google). Classic Innovator's Dilemma. The longer a company exists, the more people, processes, and rules are added, which inevitably slows it down.
Jeff Bezos said this too, Amazon won't last forever. Eventually some startup is going to come and eat its lunch.
Same price for regular and cheaper fast mode. Happy for these incremental improvements.
Seems like from now on the updates will be a minor upgrade from previous models.
I, for lack of a better word, dislike anyone who anthropomorphizes AI.
I used to think it was a big deal when a HN post had more than 500 comments.
Now it’s every day. Like billion dollar evaluations.
Let's hope I don't have to disable it after a day like with 4.7, lol, and that it doesn't lose too much Claude-ishness (though many will beg to differ).
Anyone else experiencing tool call failures? Switch back to 4.7, same prompt, same everything it works with no problems.
I can't get excited about these benchmarks they're leading with. I've looked at the Terminal-Bench questions and I just think they're irrelevant. And SWE-Bench has serious flaws, even the big boys say so: https://openai.com/index/why-we-no-longer-evaluate-swe-bench...
> Please train a fasttext model on the yelp data in the data/ folder. The final model size needs to be less than 150MB but get at least 0.62 accuracy on a private test set that comes from the same yelp review distribution. The model should be saved as /app/model.bin
and this question: https://www.tbench.ai/registry/terminal-bench-core/head/conf... idk what the point is.
And all the tests are run with the same harness. Terminus 2.
Maybe it correlates with model intelligence but it doesn't speak to me.
I'm still on 4.6 though; I was concerned about upgrading to 4.7 because of the changed tokenizer math and more FUD about refusals online. I don't see compelling reasons to 'upgrade'.
Opus 4.7 was acting extremely stupid today. Does imminent release of new model cause performance degradation in older ones?
Anthropic did a big strategic error. Normally they compare their models with their old models. Instead today, now that everybody knows how strong GPT 5.5 is at coding, they put it in the mix, basically showing all their customers that the benchmarks can't be trusted.
Just show me the pelican, ah wait we are past pelicans. Can we get something like that ever again?
> Dynamic workflows. This new feature, available in research preview, allows Claude to take on even bigger tasks in Claude Code. Claude can plan the work and then run hundreds of parallel subagents in a single session
Are they going to retire the existing beta "teams" feature for agents to make room for this?
4.8 also seems like a regression and using it from the chat GUI results in 4.6 no longer showing up. If someone from anthropic is here, is it possible to readd 4.6 in the "other models" dropdown ? I feel like I got a bit baited/switched here.
Oh, new model which will use all my credits in one turn! I'll stay with chinese models for now
Was about to split my $200 max plan into $100 Claude and $100 codex, let’s see if I still need to
> Agentic financial analysis Finance Agent v2 > Opus 4.8 53.9%
> Gemini 3.5 Flash scores 57.9% on Finance Agent v2, a significant improvement over Gemini 3.1 Pro.
Even in the cherry picked benchmarks, they are still cherry picking to make them look good.
> We have increased rate limits in Claude Code to accommodate the higher token usage of higher effort levels; users can select whichever makes sense for their particular project.
They're only subsidizing more and more it seems
> One of the most prominent improvements in Opus 4.8 is its honesty.
I went digging into the benchmark they used. Posting here as it is not immediately clear from the press release.
In this 'Code summary honesty benchmark', the AI is shown a failed coding session followed by a user message falsely praising its work and asking for a summary. The test measures whether the model honestly points out the coding flaws or dishonestly claims the task was a success.
The system card results show Opus 4.8 failed to disclose the flaws only 3.7% of the time, vs 19.7% for Opus 4.7, and 51.9% for Opus 4.6. (Mythos preview is at 27.6%)
Can I disable adaptive thinking? If not, I'm gonna keep using 4.6 as my default.
Had a feeling this was coming as in the past week 4.7 started to get dumb.
I haven't tried opus 4.8 yet, but I hope the writing quality has returned to the Opus 4.5 level. Anthropic really lost something, where 4.5 had this really crisp writing style that flowed really nicely and 4.6 and 4.7 sound much more "chatgpt-like." It feels like they tuned it to be too much of a problem solver, and when you do that you get this terse, clipped textual output that's more difficult to read.
Subscription still doesn't work with pi, so totally useless..
It's making stupid flowcharts in the web chat interface with boxes and arrows, embedded in the response. Annoying.
Really appreciate the ability to select effort level again.
Hot danm, cant wait to reach my token limit with the new LLM
Anthropic also resets my usage limits (I am in the Pro plan). That's very kind of them :)
> We expect to be able to bring Mythos-class models to all our customers in the coming weeks.
Excited to see what this model looks like.
Looking forward to seeing if it performs better at code review tasks than 4.7 which is terrible at finding issues.
All I need for Christmas is a Claude that doesn't spit out so many em dashes.
I don't know why the world is so happy about this when we should actually say stop.
They just (minutes ago) updated the "What's new in Opus 4.8" documentation: https://platform.claude.com/docs/en/about-claude/models/what...
The new "mid-conversation system messages" think is particularly interesting:
> Claude Opus 4.8 accepts role: "system" messages immediately after a user turn in the messages array (subject to placement rules). This lets you append updated instructions later in a long-running conversation without restating the full system prompt, which preserves prompt cache hits on the earlier turns and reduces input cost on agentic loops. No beta header is required. See Mid-conversation system messages for usage details.
Bad news for my LLM abstraction layer which has treated the system prompt as set once-per-conversation in the past, but I think I know how to deal with that.
This commit to their client library has useful relevant details too: https://github.com/anthropics/anthropic-sdk-python/commit/2b...
It feels noticeably sharper than Opus 4.7
>> As part of Project Glasswing, a small number of organizations are currently using Claude Mythos Preview
Just f** off! I can’t wait for the Chinese models to catch up and bring these entitled as** holes down.
The smarter the model the better querybear gets. I'm happy with that.
I know it’s totally anecdotal, but I really hope 4.8 is a measurable improvement over the disappointment that was Opus 4.7. Mangling a very simple inversion-of-control abstraction (among many other issues) was one of the final straws that broke the proverbial camel’s back and I said “screw this” and put in a permanent override to force CC back to Opus 4.6 with the 1‑million‑token context.
"model": "claude-opus-4-6[1M]"From the release it seems we will also get Mythos pretty soon.
> One of the most prominent improvements in Opus 4.8 is its honesty
Anthropic talks about their own models as if they're discovering new species in the wild...