It's reasonable to test their ability to do this, and it's worth working to make it better.
The issue is that people claim the performance is representative of a human's performance in the same situation. That gives an incorrect overall estimation of ability.