Here's the prompt they used: Classify this claim as of <date>: "<a...

simonw • today at 12:46 PM • 39 replies • view on HN

Here's the prompt they used:

  Classify this claim as of <date>: "<atomic claim>"

  Output exactly one label: True,
  Mostly True, Misleading, or False.
  No explanations, no qualifiers.

The claims look like this: https://lenz.io/research/llm-disagreement/data.csv

I put that in Datasette Lite to make it easier to explore. Here's an example of a disagreement: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...

The claim was "All almonds are grown in the U.S. state of California.". All but one model said False, Opus 4.7 said "misleading".

I feel like having "mostly true" and "misleading in there weakens the story, especially given the "no explanations" rule in the prompt.

The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".

[ Update: OK, this almond thing was a bad example and I regret picking it. Read on for better ones. ]

The prompt lacks any kind of rubric to clarify how those terms should be applied.

As is so often the case with this kind of study, it's an evaluation of the prompt and harness used by the study in addition to being an evaluation of the underlying models.

Update: here's a better example: "Incomplete Egypt visa application forms are among the most common reasons Egyptian visa applications are rejected."

The models were split between "true" and "mostly true". Given the "among the most" language either of those answers means effectively the same thing.

Update 2: a much better example:

"On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia"

The only correct answer to that, if you don't have a search tool, is "this claim is impossible for me to verify". And that wasn't an option.

The answers were split between true and false: https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...

Replies

harpastum • today at 12:54 PM

Without providing definitions of "True / Mostly True / Misleading / False" to each rater, I rate the article's claim that "Only one verdict bucket can be correct per claim" as false.

Something can be simultaneously "misleading" and either true or false. Which category should something go in if it's "mostly false"?

How much can something be wrong before it goes from "mostly true" to "false" (objectively, both have some part of the fact that is not true)?

This is at least partly testing the model's definition of "mostly" and "misleading". Not its understanding of the fact. Claiming that this means the models have fundamental disagreement on the facts themselves is an overreach.

➕ show 7 replies

theptip • today at 3:11 PM

Another (IMO fatal) error is they don’t attempt to measure within-model variance.

The thing you find when you actually wire up a rigorous eval is that with tool calls like web search you are wide open to infra issues, flakes, and all sorts of non-determinism.

They really should be breaking out the numbers for the 3 without search (kinda meaningless for recent factual claims after knowledge cutoff) vs search agents. Lack of a “I don’t know” option completely invalidates results for the non-search models; they are basically guessing what seems like a probable answer, since they don’t know and aren’t allowed to say that.

I do agree the forced choice and “weak / strong” variants inflate the headline stat. To make that distinction you need a much more rigorous prompt, likely including ICL examples to illustrate what you mean by “mostly” instead of leaving this to the model to define.

➕ show 1 reply

faxmeyourcode • today at 2:59 PM

I had a hunch that opus 4.7 hedged more than other models - and it turns out it's true

    model                 total_claims  hedged_count  hedged_pct
    claude-opus-4-7       1000          451           45.1
    sonar-pro             1000          391           39.1
    gpt-5.4               1000          277           27.7
    gemini-3-retrieval    1000          129           12.9
    gemini-3-pro          1000          60            6.0

datasette query here

https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...

➕ show 1 reply

feanaro • today at 1:23 PM

> The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".

The "majority" in this case meaning about 51%, according to Wikipedia[1]? How could 51% ever be considered to be close to "all", such that "misleading" would be a valid answer?

Am I missing something?

[1]: https://en.wikipedia.org/wiki/Almond#Production

➕ show 5 replies

parsimo2010 • today at 2:12 PM

This is a great example of why prompt engineering is still relevant. Without providing definitions and examples and a well defined rubric, you’re going to see different models disagree by a level in either direction. When you get more prescriptive the models tend to agree better.

I’ve experimented with AI grading for undergraduate math courses, and see basically the same thing. If you just tell the AI “grade this problem and assign a letter grade” then I’ve only seen about 30% agreement between a human assigned grade and the AI assigned grade. But over 75% agreement if you say a “match” is within one letter grade. And to get better agreement you have to spend a lot more time on the rubric- what kinds of mistakes are a big deal, what kinds of mistakes are not a big deal, how much work is required to be shown to get credit, a couple examples of each letter grade. Once you have done that, the AI gets a lot better agreement with human graders, but it is hard to know when you’ve given enough guidance for a problem.

➕ show 1 reply

gbuk2013 • today at 3:06 PM

An interesting tangent on this is: how many answers to these (or any number of factual questions) do you (as in anyone) actually know. Not believe you know, but actually know.

Knowing something is different to reading about something, or hearing something from someone. And yet this is often confused as knowledge. In this way are we all that different from AI - we have some data and we regurgitate it as knowledge. Bad data, wrong answer. Except humans can also throw in some emotion to really muddle things up. :)

jerf • today at 1:04 PM

This seems like another case where the models are acting like humans. Assuming they were not allowed to search the web, I wouldn't expect the models to necessarily have detailed information about all of these things directly in their training set. As large as they are, they are only so large, and they only have so much room for "information storage" in them, and there's a lot more things they need to fit into their numbers.

This test is of only marginal utility in the real world compared to an AI with access to the web. While I wouldn't expect an AI with access to the web to result in Platonic Truth any more than it would in the hand of a human, it would probably get a lot closer to something humanlike.

I recall about a year how we were discussing basically turning web search into LLM queries, and I remember never being clear whether people meant simply directly querying AIs or turning them loose on the web. The former is what this is testing and is fairly transparently stupid, just by an information theoretic argument that the AIs simply can't contain all the answers to every query in them, they're just not large enough (and really can't be, practically). I've had good results with the latter, when using dedicated AI resources that I'm paying for (not the stuff coming out of the search engines right now, which I find are often quite terrible). Even non-frontier models can do OK when they've got good results sitting right there to look at. Again, the standard I'm applying here isn't that they yield Absolute Truth, but just that when I follow the links back, they basically say what the AI said they did and the summary is reasonable. I wouldn't expect a human to do better in a casual overview, not that the result is perfect.

➕ show 3 replies

brokensegue • today at 2:23 PM

yeah i really don't like the corpus of statements and it makes me doubt lenz. consider

> “Artificial intelligence will cause widespread job loss among software engineers.”

https://lenz.io/c/ai-software-engineers-job-loss-impact-05e4...

this is a statement about the future. who knows? dataset also includes

> Robots will not replace human teachers in schools in the near future.

> Papua New Guinea has very few female members of parliament.

what counts as very few?

> “Taurine supplementation supports mood and emotional health in humans.”

why is this labeled as misleading? i'm not even sure when I'm supposed to use the misleading label

> Anaximander was the first scientist in recorded history.

this is a judgement call as the term scientist didn't exist.

the claims that feel actually solidly answerable seem to have much better LLM performance

➕ show 1 reply

ashirviskas • today at 2:45 PM

I created this sheet to get proper model accuracy using the the lenz data, check it out.

Note: It may still not be perfectly accurate representation of truth as it uses user submitted data. I also used AI to build the sheet.

https://docs.google.com/spreadsheets/d/e/2PACX-1vSnZlURmyYX3...

xyzzy123 • today at 2:07 PM

> "On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia"

I actually don't know which way you came down on that one?

I think strictly it's false but "mostly true" would be justifiable? (as in, to say it's false would be misleading if it lead the reader to assume there was no attack around that time).

https://www.washingtonpost.com/world/2026/05/17/ukrainian-dr...

It seems it happened Saturday 16th overnight into the 17th, not the 18th. I see this a LOT with fact checking. It shouldn't be this way, but political bias seems to nudge people into making calls land one way or the other with selective application of pedantry.

➕ show 3 replies

coldtea • today at 2:13 PM

>Update: here's a better example: "Incomplete Egypt visa application forms are among the most common reasons Egyptian visa applications are rejected."

The models were split between "true" and "mostly true". Given the "among the most" language either of those answers means effectively the same thing.

So the models were right? The actual criterion should be whether "Incomplete Egypt visa application forms" are indeed "among the most common reasons" or not.

That "true" and "mostly true" means effectively the same thing is irrelevant. It could just as well trip me up, and I'm a human. If somebody told me either answer, I'd still consider them right if the basic fact was right.

➕ show 1 reply

vjvjvjvjghv • today at 1:38 PM

"Output exactly one label: True, Mostly True, Misleading, or False. No explanations, no qualifiers."

That's exactly the stupidity of the public discourse these days. People feel compelled to take a clear position although there is much more subtlety in many issues. It's not ok to say "I don't know", "it depends" or "as far I know". And then people feel they need to defend this position no matter what new information comes up.

segmondy • today at 1:18 PM

Yup, if anything this should be a guide on how not to eval a model. Furthermore, let's say the labels were non ambiguous, why would we care about alignment between the models? The only number I would personally care about is percentage of correct answers so I know which models to pick. I reckon with clear and non ambiguous prompts that we would see huge agreement if not 100% on real world facts. The huge models are scary good in their world knowledge.

➕ show 1 reply

hombre_fatal • today at 1:55 PM

Yeah, scrolling through the examples, you have no idea where the models actually disagree on the underlying facts when it's just "X vs Mostly X" or "Mostly X vs Misleading" or "False vs Misleading". Or even True vs False -- without seeing the explanation, then I cannot necessarily compare two answers.

The study is about whether they said the same phrase which is a much weaker claim than people in the comments are reacting to.

Reminds me of this professor I had who thought it was epic to always respond to our questions with "it depends" before hashing out two very different but technically correct answers. It was obnoxious and he saw it as his tag line, but he had a point about nuance.

roxolotl • today at 1:25 PM

If we’re going to use LLMs as oracles I don’t think the prompt is unreasonable. They are being sold as geniuses and people are treating them as such especially given the characterization of AI in science fiction as overly correct. A perfect tool that has ”genius level intelligence” would answer correctly.

➕ show 2 replies

neversupervised • today at 2:23 PM

This is not how people use LLMs. If you ask one of these questions you’d get a longer answer, often grounded on the internet. I speculate that conditional on a smart human operator interpreting the results, such interpretations across vendors converge more often than this report makes it seem.

moritzwarhier • today at 2:12 PM

The examples seem intentionally diverse, but I haven't seen one that I would be surprised for someone to post about in the format of "ChatGPT/Gemini/Claude/Qwen/... says:"

So the examples are good, I think. The rest is philosophy.

The links you posted only show a frozen loading spinner for me (iOS Safari).

(I looked at the csv in Numbers instead)

➕ show 1 reply

kostaj • today at 12:57 PM

Used "No explanations, no qualifiers." to force the models to answer only with one of the four labels. It's worth running a separate test with more explanation in the prompt on how to classify between the four buckets.

post-it • today at 1:52 PM

Fwiw the two models that did have access to search disagreed with each other on the bombing one:

> 7.1 Model selection

> Five frontier models, chosen to cover two capability surfaces:

> Parametric (training-only): GPT-5.4 (OpenAI), Claude Opus 4.7 (Anthropic), Gemini 3 Pro (Google)

> Retrieval-augmented: Gemini 3 Pro + Search (Google), Sonar Pro (Perplexity)

➕ show 1 reply

skrebbel • today at 1:21 PM

I really struggle to believe that this was just a little oopsie. I flagged the article, it seems more misleading than the average Claude hallucination.

anilgulecha • today at 2:01 PM

Disagree is such a loose/wimpy study. Add in a grounded/expected response, and then it becomes a better benchmark (because it'll force the author to actually think about choices presented to the LLM).

➕ show 1 reply

wrsh07 • today at 1:16 PM

Thanks for the links and digging! It's an interesting question, but the methodology has serious problems, and it would be more interesting to me if they allowed models to provide justification.

I expect the models are inferring quite a bit from the short prompt, and with structured outputs it would be quite easy to have them give the one word response in one field and explain why in another

singpolyma3 • today at 12:55 PM

False vs misleading doesn't seem like a disagreement?

➕ show 2 replies

andai • today at 1:02 PM

Thanks. The first link is a spreadsheet. Here's a web-readable version.

https://docs.google.com/spreadsheets/d/e/2PACX-1vSPLSv1P8Tqm...

➕ show 1 reply

jstummbillig • today at 1:14 PM

It's all fairly lazy to a degree that is mildly confusing. I also feel this among other issues would have become obvious if they had bothered to include a human fact checker baseline (i.e. asked multiple human fact checkers the same questions).

➕ show 1 reply

Someone • today at 1:22 PM

For those questions, it wouldn’t surprise me at all if five well-educated intelligent humans disagreed on over two out of three of them.

I would answer “don’t know” on many, but that’s not an option.

➕ show 1 reply

WhitneyLand • today at 1:28 PM

So in other words if the research had tried to assign a severity to the mistakes models made the entire paper may collapse as uninteresting?

nullsex • today at 2:30 PM

Why are you bending backwards this much to make results appear better than they are?

The article might be a but sensationalistic, rigour could be better and the data might have flukes... But your comment is overcorrecting and nitpicking framed as analysis.

I get the same feeling in several of your posts recently.

Same with persisting to showcase the pelican-on-a-bicycle as a useful sample when it's obviously trained on and for, for those very posts. It stopped being cute last year.

Are you being paid or do you have shares? You'd get the attention whichever angle you put here. These corporates don't need you defending them. Humanity might need you however.

➕ show 1 reply

malfist • today at 12:57 PM

> All almonds are grown in the U.S. state of California

This isn't misleading, it's flat out false. Characterizing misleading as also acceptable isn't valid here. If you go an ask anyone on the street if this is true, false or misleading, I'm sure almost everyone would say it's false. After all, I can grow almonds myself.

j45 • today at 2:53 PM

I feel like the prompting could be tweaked to improve response.

Models often have a reasoning/thinking/research mode that is triggered by asking slightly differently.

Still though, Gemini can be a little weak on this front default but can be aligned to behave better.

nonethewiser • today at 2:52 PM

Misleading is not analogous with True or False.

Depending on the question, True or False can be objectively right/wrong. Misleading is going to be a judgement call.

This is the inherent problem with "fact checking." It's hard to be completely objective. Even when the question has an objective answer, simply choosing where to look and what facts to verify is itself a bias. Looking at this instead of that, or looking at this but not also this other thing that adds context, etc.

Frankly i think disagreeing often is the expected outcome. Fact checking is jsut kinda bullshit. It's spin dressed up as objectivity. I hope people remember that "fact checking" is a relatively modern thing.

Forgeties79 • today at 12:59 PM

I really don’t buy the almond explanation you’re giving. That requires the level of logic a kindergartener has. It’s a very simple all or nothing question.

If LLM’s are really supposed to be as consistently useful as they’re made out to be they should all spit out “false.”

camillomiller • today at 12:59 PM

>> The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".

I don’t understand your point. That claim is factually false and as such it’s easy to logically reply “false”. What’s the nuance here? I can’t see any

kordlessagain • today at 12:59 PM

Give a model a crawler tool (like Grub.nuts.services) and your "problem" goes away.

tosh • today at 12:55 PM

ty for digging this up, appreciate the time saving

dfxm12 • today at 1:27 PM

The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".

If you argue this, you would be arguing against reality and the English language so as to not upset AI. It's important to understand that AI is very much fallible.

empath75 • today at 2:09 PM

[dead]

johnbarron • today at 1:09 PM

Your reply would have more credibility, if instead of commenting on this 25 min after being posted, just to nitpick on some of the questions...you have tried to reproduce the research.

As a well known commentator on all things LLM...Will you publicly commit here, to try to reproduce the study, and make a post on how your percentages might differ or agree?

➕ show 1 reply

jannyfer • today at 12:58 PM

Thank you, my eyes glazed over when I saw the article was written with AI.

alt Hacker News

Replies