Yes. Most benchmarks just measure how many answers are correct. The best way to optimize that is to ...

wongarsu • today at 2:15 PM • 1 reply • view on HN

Yes. Most benchmarks just measure how many answers are correct. The best way to optimize that is to confidently state something, in hopes it's correct. Which is exactly how most LLMs behave, despite plenty of evidence that they do know whether they "know" something

Replies

Imustaskforhelp • today at 2:50 PM

if this is the case, then GLM 5.2 model seems better than gpt 5.5 or maybe even "Fable" depending upon what you are trying to achieve.

Fable model being removed from Anthropic because of security concerns by the US government (or well, also partially because of the personal vendetta between US govt and Anthropic)

alt Hacker News

Replies