Actually - do they do this in LLM benchmarks? As a measure of overconfidence/confabulation? Seems immediately applicable.