That's the thing with benchmarks, without evals and actual hands-on experience they can give yo...

orbital-decay • yesterday at 9:25 PM • 2 replies • view on HN

That's the thing with benchmarks, without evals and actual hands-on experience they can give you false confidence. Claude now sounds almost clinical, and is unable to speak in different styles as easily. Claude 4+ uses a lot more constructions borrowed from English than Claude 3, especially in Slavic languages where they sound unnatural. And most modern models eventually glitch out in longer texts, spitting a few garbage tokens in a random language (Telugu, Georgian, Ukrainian, totally unrelated), then continuing in the main language like nothing happened. It's rare but it happens. Samplers do not help with this, you need a second run to spellcheck it. This wasn't a problem in older models, it's a widespread issue that roughly correlates with the introduction of reasoning. Another new failure mode is self-correction in complicated texts that need reading comprehension: if the model hallucinates an incorrect fact and spots it, it tries to justify or explain it immediately. Which is much more awkward than leaving it incorrect, and also those hallucinations are more common now (maybe because the model learns to make those mistakes together with the correction? I don't know.)

Replies

awongh • yesterday at 9:45 PM

Not disputing this might be true, but this seems like something that should be capturable in a multi-lingual benchmark.

Maybe it's just something that people aren't bothered with?

➕ show 1 reply

Der_Einzige • today at 3:24 AM

Btw samplers do in fact help with this. Random tokens deep in your output context are due to accumulated sampling errors from using shit samplers like top_p and top_k with temperature.

Use a full distribution aware sampler like p-less decoding, top-H, or top-n sigma, and this goes away

Yes the paper for this will be up for review at NeurIPS this year.

alt Hacker News

Replies