I’m amazed by how much Gemini 3 flash hallucinates; it performs poorly in that metric (along with lots of other models). In the Hallucination Rate vs. AA-Omniscience Index chart, it’s not in the most desirable quadrant; GPT-5.1 (high), opus 4.5 and 4.5 haiku are.
Can someone explain how Gemini 3 pro/flash then do so well then in the overall Omniscience: Knowledge and Hallucination Benchmark?
I’m amazed by how much Gemini 3 flash hallucinates; it performs poorly in that metric (along with lots of other models). In the Hallucination Rate vs. AA-Omniscience Index chart, it’s not in the most desirable quadrant; GPT-5.1 (high), opus 4.5 and 4.5 haiku are.
Can someone explain how Gemini 3 pro/flash then do so well then in the overall Omniscience: Knowledge and Hallucination Benchmark?