logoalt Hacker News

Claude Sonnet 5

666 pointsby marinesebastiantoday at 5:59 PM348 commentsview on HN

Comments

artursapektoday at 8:41 PM

I run a proofreading benchmark that tests how well models can find and fix errors in English text. They get several passes in a simple agent loop. Sonnet 5 is definitely better than Sonnet 4.6, but inferior on both quality and cost to GLM 5.1, GLM 5.2, Gemini 3.1 Flash, and Gemini 3.1 Pro. https://revise.io/errata-bench

mellostytoday at 6:27 PM

It does not pass the "I want to wash my car, should I drive or walk"

show 2 replies
ai_fry_ur_braintoday at 8:38 PM

Finally a model release where everyone is realising the scam. The world is healing (maybe).

smallerfishtoday at 6:20 PM

Ah that's why Opus has been so slow for the last couple of days.

prmphtoday at 7:38 PM

So many things to think about regarding these "benchmarks":

- Do the ever increasing scores on the mean we will soon have models that approach 100%? And what would that even mean? That there is no more room for improvement?

- Would Anthropic (or any other model vendor for that matter) ever release a newer model that scores lower? If not, does that mean they keep tweaking a new model they want to release until it shows an improvement of the prior model?

- Would it be more useful to move toward a comparative rather than absolute ranking?

joaohaastoday at 7:38 PM

Important to note that the cost graphs are heavily distorted. The agentic serch one for example is divided into 3 'columns': $0-$2, $2-$5 and $5-$10.

And yet, the $2-$5 section is the widest, even though it only contains a single point.

I can't even say if this is making the product look better or not, but it sure is weird. Maybe Claude just hallucinated those splits xD

tensegristtoday at 6:01 PM

there was a vibecoded prediction market–style page that was put up yesterday (?) that got the date exactly right i think

show 1 reply
PeterStuertoday at 7:20 PM

Anyone else feel like Opus 4.8 got significantly dumber over the last 2 weeks?

Scroll_Swetoday at 6:10 PM

I don't pay so I'm glad for the upgrade. I usually use Gemini, Mistral Le Chat (Vibe...) or Deepseek as they have way more generous free limits and I can basically spam forever.

docheinestagestoday at 6:23 PM

Is it just me or is there a huge difference between how much one can accomplish in a 5-hour window with GPT 5.5 on xhigh versus any Claude model?

show 1 reply
_pdp_today at 7:05 PM

Too expensive?

jchwtoday at 6:16 PM

American AI company status: We are now bragging about how bad our models are unironically.

Okay.

andrewchamberstoday at 7:40 PM

The whole fable fiasco really soured me on Anthropic. This just looks disappointing by comparison.

gverrillatoday at 6:45 PM

Is this the default model for non-paying users? If so, that could be an interesting move in the competition for this segment.

ekjhgkejhgktoday at 6:31 PM

In effective terms they're lowering prices.

Getchownedtoday at 6:51 PM

Fable soon please.

micromacrofoottoday at 6:26 PM

So they repackaged Fable and added "don't scare the government" to the prompt

show 1 reply
moomintoday at 6:15 PM

I feel like this is a bit of a disappointment. Sonnet 4 was a clear step above Opus 3.x, while this is a lot muddier.

mesmertechtoday at 6:08 PM

Ok thats a one month clock to the next Opus model at least, so thats a silver lining to a meh model.

stackedinsertertoday at 6:29 PM

"Our new model is proudly dumber now!"

show 1 reply
varispeedtoday at 7:41 PM

What is the point if it is one Trump's brain fart away from being blocked?

Danii27today at 8:11 PM

[flagged]

justicehuntertoday at 6:10 PM

[dead]

aykutsekertoday at 6:57 PM

[dead]

ricardobeattoday at 6:23 PM

[dead]

lucynighttoday at 6:13 PM

AMAZING