I appreciate horizon expansion as a fundamental metric, but duration seems like too crude a measure. We used to like it when computers were fast.
An infinitely unscrupulous model provider could double this five hour result by cutting your output tokens/second in half!
This isn't only a question of gaming the metric: the very strong current small-fast models (4.5 Haiku, Gemini 3 Flash) have no hope of being measured fairly against this - they will succeed or fail much faster just because they are much faster.
How about something like total output token count as the "long term horizon" metric instead?
Task duration is the time it would take for humans to complete the task. The speed of the models and how how long they might take to complete the task is not part of this metric.
The time (horizon) here is not that of the model completing the task, but a human completing the task.