To be clear this doesn't mean that it takes the AI > 4 hours to do the task. METR is measuri...

dwohnitmok • today at 5:39 AM • 0 replies • view on HN

To be clear this doesn't mean that it takes the AI > 4 hours to do the task. METR is measuring the difficulty of tasks by how long it takes a human to do the same task. This benchmark is saying that Opus 4.5 can now do tasks (related to AI R&D, coding foremost among them) that take human experts > 4 hours (at a 50% reliability level; whether that's actually useful depends on of course the cost of failure). It is silent on how long it takes AI systems to do those tasks. In theory an AI system could take longer than that (in practice it's usually significantly shorter).

This is of course quite highly correlated with an AI system being able to churn through a task for a long time. But it's not necessarily the same thing.

Of course the big questions are going to arise if/when we start passing lines like 8 hours (a whole work day) or 40 hours (a whole work week).

alt Hacker News