It makes perfect sense to use human times as a baseline. Because otherwise, the test would be biased towards models with slower inference.
If model A generates 10 tokens a second and model B generates 100 tokens a second, then using real LLM inference time puts A at a massive 10x advantage, all other things equal.