METR is using hours of equivalent human effort, not actual hours the agent itself spends, so by thei...

twotwotwo • today at 5:44 AM • 2 replies • view on HN

METR is using hours of equivalent human effort, not actual hours the agent itself spends, so by their methodology, your task might qualify as one where it pulls off much more than 4h of human work.

"Human hours equivalent" itself is an interesting metric, because: which human? Or rather, I'm sure they had a coherent definition in mind: presumably a human reasonably competent at whatever the specific task is. But hours the abstract human standard would spend is different from the hours any specific person, say you or I, would spend.

In particular, some of the appeal (and risk!!) of these things is precisely that you can ask for help with things that would be quick work for someone (who knows jq, or a certain corner of the PyPI library ecosystem, or modern CSS, or TypeScript annotations, or something else) but not for you.

Replies

TobiasJBeers • today at 7:19 AM

The “50% time horizon” feels most actionable when you pair it with an expected-value model. For a given task: EV ≈ (human_time_saved × $/hour) − (p_fail × cost_of_failure) − (iteration/oversight cost). A model crossing 4h-at-50% might be hugely useful for low failure-cost work, and still net-negative for anything where rollback/debug is expensive. The missing piece is how p_fail scales with task length + how recoverable failures are.

➕ show 1 reply

nightshift1 • today at 7:43 AM

>which human

The second graph has this under it:

The length of tasks (measured by how long they take human professionals) that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years...

➕ show 1 reply

alt Hacker News

Replies