logoalt Hacker News

wongarsutoday at 2:15 PM1 replyview on HN

Yes. Most benchmarks just measure how many answers are correct. The best way to optimize that is to confidently state something, in hopes it's correct. Which is exactly how most LLMs behave, despite plenty of evidence that they do know whether they "know" something


Replies

Imustaskforhelptoday at 2:50 PM

if this is the case, then GLM 5.2 model seems better than gpt 5.5 or maybe even "Fable" depending upon what you are trying to achieve.

Fable model being removed from Anthropic because of security concerns by the US government (or well, also partially because of the personal vendetta between US govt and Anthropic)