logoalt Hacker News

ACCount37today at 6:46 AM0 repliesview on HN

It makes perfect sense to use human times as a baseline. Because otherwise, the test would be biased towards models with slower inference.

If model A generates 10 tokens a second and model B generates 100 tokens a second, then using real LLM inference time puts A at a massive 10x advantage, all other things equal.