This seems like another case where the models are acting like humans. Assuming they were not allowed to search the web, I wouldn't expect the models to necessarily have detailed information about all of these things directly in their training set. As large as they are, they are only so large, and they only have so much room for "information storage" in them, and there's a lot more things they need to fit into their numbers.
This test is of only marginal utility in the real world compared to an AI with access to the web. While I wouldn't expect an AI with access to the web to result in Platonic Truth any more than it would in the hand of a human, it would probably get a lot closer to something humanlike.
I recall about a year how we were discussing basically turning web search into LLM queries, and I remember never being clear whether people meant simply directly querying AIs or turning them loose on the web. The former is what this is testing and is fairly transparently stupid, just by an information theoretic argument that the AIs simply can't contain all the answers to every query in them, they're just not large enough (and really can't be, practically). I've had good results with the latter, when using dedicated AI resources that I'm paying for (not the stuff coming out of the search engines right now, which I find are often quite terrible). Even non-frontier models can do OK when they've got good results sitting right there to look at. Again, the standard I'm applying here isn't that they yield Absolute Truth, but just that when I follow the links back, they basically say what the AI said they did and the summary is reasonable. I wouldn't expect a human to do better in a casual overview, not that the result is perfect.
While I agree with what you’re saying the typical AI agent doesn’t say “I’m not totally sure about this, should I search the web?”. It often just spits out a reply based on its knowledge.
Two of the five models used (Gemini+Search and Sonar Pro) have retrieval capabilities and used search when classifying the claims. The disagreement between them is still quite significant - 42%.
Can you share what you mean by this?
> when using dedicated AI resources that I'm paying for
Are there API-based search providers that structure their results differently?