It does really well on "AA-Omniscience Non-Hallucination Rate", far higher than DeepSeek, ...

wongarsu • today at 11:46 AM • 2 replies • view on HN

It does really well on "AA-Omniscience Non-Hallucination Rate", far higher than DeepSeek, GPT 5.5 or Fable. I really like that benchmark because it's one of the few benchmarks that allows LLMs to elect not to answer if they are unsure and punishes them for trying to bullshit their way through the benchmark

Replies

SilverServer • today at 2:00 PM

It took me a while to figure out how to interpret the benchmark correctly, because on the overview page it says "AA-Omniscience Non-Hallucination Rate," but on the benchmark page https://artificialanalysis.ai/evaluations/omniscience#aa-omn...

it said "the lower, the better." Eventually, I realized that the "non" reverses the scores. And indeed, the results are consistent.

andai • today at 1:35 PM

This implies that other benchmarks (for which every AI provider is optimizing?) are actively encouraging bullshitting?

➕ show 4 replies

alt Hacker News

Replies