That’s what makes it a fair evaluation and something that requires improvement. We shouldn’t only evaluate agent skills by what is most commonly represented in training data. We expect performance from them on areas that existing training data may be deficient at providing. You don’t need to invent an absurdity to find these cases.
It's reasonable to test their ability to do this, and it's worth working to make it better.
The issue is that people claim the performance is representative of a human's performance in the same situation. That gives an incorrect overall estimation of ability.