Claude Sonnet 5

666 points • by marinesebastian • today at 5:59 PM • 348 comments • view on HN

Comments

I run a proofreading benchmark that tests how well models can find and fix errors in English text. They get several passes in a simple agent loop. Sonnet 5 is definitely better than Sonnet 4.6, but inferior on both quality and cost to GLM 5.1, GLM 5.2, Gemini 3.1 Flash, and Gemini 3.1 Pro. https://revise.io/errata-bench

mellosty • today at 6:27 PM

It does not pass the "I want to wash my car, should I drive or walk"

➕ show 2 replies

ai_fry_ur_brain • today at 8:38 PM

Finally a model release where everyone is realising the scam. The world is healing (maybe).

smallerfish • today at 6:20 PM

Ah that's why Opus has been so slow for the last couple of days.

prmph • today at 7:38 PM

So many things to think about regarding these "benchmarks":

- Do the ever increasing scores on the mean we will soon have models that approach 100%? And what would that even mean? That there is no more room for improvement?

- Would Anthropic (or any other model vendor for that matter) ever release a newer model that scores lower? If not, does that mean they keep tweaking a new model they want to release until it shows an improvement of the prior model?

- Would it be more useful to move toward a comparative rather than absolute ranking?

joaohaas • today at 7:38 PM

Important to note that the cost graphs are heavily distorted. The agentic serch one for example is divided into 3 'columns': $0-$2, $2-$5 and $5-$10.

And yet, the $2-$5 section is the widest, even though it only contains a single point.

I can't even say if this is making the product look better or not, but it sure is weird. Maybe Claude just hallucinated those splits xD

tensegrist • today at 6:01 PM

there was a vibecoded prediction market–style page that was put up yesterday (?) that got the date exactly right i think

➕ show 1 reply

PeterStuer • today at 7:20 PM

Anyone else feel like Opus 4.8 got significantly dumber over the last 2 weeks?

Scroll_Swe • today at 6:10 PM

I don't pay so I'm glad for the upgrade. I usually use Gemini, Mistral Le Chat (Vibe...) or Deepseek as they have way more generous free limits and I can basically spam forever.

docheinestages • today at 6:23 PM

Is it just me or is there a huge difference between how much one can accomplish in a 5-hour window with GPT 5.5 on xhigh versus any Claude model?

➕ show 1 reply

_pdp_ • today at 7:05 PM

Too expensive?

jchw • today at 6:16 PM

American AI company status: We are now bragging about how bad our models are unironically.

Okay.

andrewchambers • today at 7:40 PM

The whole fable fiasco really soured me on Anthropic. This just looks disappointing by comparison.

gverrilla • today at 6:45 PM

Is this the default model for non-paying users? If so, that could be an interesting move in the competition for this segment.

ekjhgkejhgk • today at 6:31 PM

In effective terms they're lowering prices.

Getchowned • today at 6:51 PM

Fable soon please.

micromacrofoot • today at 6:26 PM

So they repackaged Fable and added "don't scare the government" to the prompt

➕ show 1 reply

moomin • today at 6:15 PM

I feel like this is a bit of a disappointment. Sonnet 4 was a clear step above Opus 3.x, while this is a lot muddier.

mesmertech • today at 6:08 PM

Ok thats a one month clock to the next Opus model at least, so thats a silver lining to a meh model.

stackedinserter • today at 6:29 PM

"Our new model is proudly dumber now!"

➕ show 1 reply

varispeed • today at 7:41 PM

What is the point if it is one Trump's brain fart away from being blocked?

Danii27 • today at 8:11 PM

[flagged]

justicehunter • today at 6:10 PM

[dead]

aykutseker • today at 6:57 PM

[dead]

ricardobeat • today at 6:23 PM

[dead]

lucynight • today at 6:13 PM

AMAZING

alt Hacker News

Claude Sonnet 5

Comments