You can't really compare to human performance because the failure modes and performance characteristics are so different.
In some instances you'll get results that are shockingly good (and in no time), in others you'll have a grueling experience going in circles over fundamental reasoning, where you'd probably fire any person on the spot for having that kind of a discussion chain.
And there's no learning between sessions or subject area mastery - results on the same topic can vary within same session (with relevant context included).
So if something is superhuman and subhuman a large percentage of time but there's no good way of telling which you'll get or how - the result isn't the average if you're trying to use the tool.