> barely competitive
It's higher than all other models except vs Gemini 3.1 Pro on MMMLU
MMMLU is generally thought to be maxed out - as it it might not be possible to score higher than those scores.
> Overall, they estimated that 6.5% of questions in MMLU contained an error, suggesting the maximum attainable score was significantly below 100%[1]
Other models get close on GPQA Diamond, but it wouldn't be surprising to anyone if the max possible on that was around the 95% the top models are scoring.