It does not pass the "I want to wash my car, should I drive or walk"
Finally a model release where everyone is realising the scam. The world is healing (maybe).
Ah that's why Opus has been so slow for the last couple of days.
So many things to think about regarding these "benchmarks":
- Do the ever increasing scores on the mean we will soon have models that approach 100%? And what would that even mean? That there is no more room for improvement?
- Would Anthropic (or any other model vendor for that matter) ever release a newer model that scores lower? If not, does that mean they keep tweaking a new model they want to release until it shows an improvement of the prior model?
- Would it be more useful to move toward a comparative rather than absolute ranking?
Important to note that the cost graphs are heavily distorted. The agentic serch one for example is divided into 3 'columns': $0-$2, $2-$5 and $5-$10.
And yet, the $2-$5 section is the widest, even though it only contains a single point.
I can't even say if this is making the product look better or not, but it sure is weird. Maybe Claude just hallucinated those splits xD
there was a vibecoded prediction market–style page that was put up yesterday (?) that got the date exactly right i think
Anyone else feel like Opus 4.8 got significantly dumber over the last 2 weeks?
I don't pay so I'm glad for the upgrade. I usually use Gemini, Mistral Le Chat (Vibe...) or Deepseek as they have way more generous free limits and I can basically spam forever.
Is it just me or is there a huge difference between how much one can accomplish in a 5-hour window with GPT 5.5 on xhigh versus any Claude model?
Too expensive?
American AI company status: We are now bragging about how bad our models are unironically.
Okay.
The whole fable fiasco really soured me on Anthropic. This just looks disappointing by comparison.
Is this the default model for non-paying users? If so, that could be an interesting move in the competition for this segment.
In effective terms they're lowering prices.
Fable soon please.
So they repackaged Fable and added "don't scare the government" to the prompt
I feel like this is a bit of a disappointment. Sonnet 4 was a clear step above Opus 3.x, while this is a lot muddier.
Ok thats a one month clock to the next Opus model at least, so thats a silver lining to a meh model.
What is the point if it is one Trump's brain fart away from being blocked?
[flagged]
[dead]
[dead]
[dead]
AMAZING
I run a proofreading benchmark that tests how well models can find and fix errors in English text. They get several passes in a simple agent loop. Sonnet 5 is definitely better than Sonnet 4.6, but inferior on both quality and cost to GLM 5.1, GLM 5.2, Gemini 3.1 Flash, and Gemini 3.1 Pro. https://revise.io/errata-bench