Without providing definitions of "True / Mostly True / Misleading / False" to each rater, I rate the article's claim that "Only one verdict bucket can be correct per claim" as false.
Something can be simultaneously "misleading" and either true or false. Which category should something go in if it's "mostly false"?
How much can something be wrong before it goes from "mostly true" to "false" (objectively, both have some part of the fact that is not true)?
This is at least partly testing the model's definition of "mostly" and "misleading". Not its understanding of the fact. Claiming that this means the models have fundamental disagreement on the facts themselves is an overreach.
If you can consistently construct "true but misleading" content, you may be qualified to work at a major newspaper.
> True / Mostly True / Misleading / False
> Which category should something go in if it's "mostly false"?
For some reason they have chosen to call that "Misleading" rather than a more symmetrical "Mostly False", but the intent seems clear enough.
> I guess the goal is to test the models and not the harness
Less important than the harness, is the system/user prompts themselves (which of course, are put in the harness), which is effectively what this study seems to be testing. With a better prompt, I'm sure the models would look more the same to each other, as the biggest/best models have more or less identical strong prompt-adherence in my experience.
But do you think the article is
A) false
B) misleading>Something can be simultaneously "misleading" and either true or false. Which category should something go in if it's "mostly false"?
Disagree. The definition of misleading is a true fact that is presented in a way to lead you to a false conclusion.
Example: "Most good engineers are male". It is true as a consequence of most engineers being male in general, but it leads the reader to a potential false implication that an average man is better than an average woman.
This does not invalid your point though. Things can be true and misleading.
> Something can be simultaneously "misleading" and either true or false.
Sure they can. It might be a true fact that "100% of the murders committed in <town> over the last 25 years were committed by <some racial group>!" but actually it's a town of 750 people and there was only one murder during that time frame.
But the models are more intelligent than humans already and sentient beings, right? So they shall know the meanings innately. So, you don’t need to explain them what they mean.
You may give them better instructions, but they should already have the intellect to understand the assignment.
Right, right?
Yes, the labels are weird. Most misleading statements are true. Any "mostly true" statement is false.
I suspect the intention was "Factually true, and no gotchas exist", "technically not true, but so close to the truth that the difference doesn't matter", "technically true, but there are major gotchas" and "factually false and not even close". But that's not what they specified