In human multiple choice tests they sometimes use negative marking to discourage guessing. It feels ...

ZeroGravitas • yesterday at 8:24 PM • 1 reply • view on HN

In human multiple choice tests they sometimes use negative marking to discourage guessing. It feels like exploits should cancel out several correct solutions.

Replies

lambda • yesterday at 9:29 PM

Unfortunately, very few LLM benchmarks do this. LLMs get such high scores on many benchmarks because there's no difference between answering "I don't know" as giving a made up answer, and made up answers can improve the score some of the time, so by chasing higher benchmark numbers on these kinds of benchmarks, the labs are prioritizing guessing over accuracy.

The Artificial Analysis Omniscience benchmark does penalize guessing, so it actually helps you determine which LLMs are likely to just guess rather than telling you they don't know. Only a very few of the frontier models actually score higher than 0 on this, where 0 means that it's equally likely to return a correct answer as it is to return a hallucination on factual questions.

alt Hacker News

Replies