logoalt Hacker News

ZeroGravitasyesterday at 8:24 PM1 replyview on HN

In human multiple choice tests they sometimes use negative marking to discourage guessing. It feels like exploits should cancel out several correct solutions.


Replies

lambdayesterday at 9:29 PM

Unfortunately, very few LLM benchmarks do this. LLMs get such high scores on many benchmarks because there's no difference between answering "I don't know" as giving a made up answer, and made up answers can improve the score some of the time, so by chasing higher benchmark numbers on these kinds of benchmarks, the labs are prioritizing guessing over accuracy.

The Artificial Analysis Omniscience benchmark does penalize guessing, so it actually helps you determine which LLMs are likely to just guess rather than telling you they don't know. Only a very few of the frontier models actually score higher than 0 on this, where 0 means that it's equally likely to return a correct answer as it is to return a hallucination on factual questions.