So do humans asked to answer tests. The appropriate thing is to compare to human performance at the same task.
At most of these comprehension tasks, AI is already superhuman (in part because Gary picked scaled tasks that humans are surprisingly bad at).
You can't really compare to human performance because the failure modes and performance characteristics are so different.
In some instances you'll get results that are shockingly good (and in no time), in others you'll have a grueling experience going in circles over fundamental reasoning, where you'd probably fire any person on the spot for having that kind of a discussion chain.
And there's no learning between sessions or subject area mastery - results on the same topic can vary within same session (with relevant context included).
So if something is superhuman and subhuman a large percentage of time but there's no good way of telling which you'll get or how - the result isn't the average if you're trying to use the tool.