Measuring AI Ability to Complete Long Tasks

200 points • by spicypete • today at 4:06 AM • 150 comments • view on HN

Comments

The key insight from this benchmark is using "human-equivalent hours" rather than actual AI execution time. It's measuring capability complexity, not speed.

What's interesting is the 50% vs 80% reliability gap. At 50% success rate on a 4-hour task, you're essentially gambling. If it fails, you've potentially wasted the 4 hours plus the time debugging why it failed.

This is why I think the current "agent" paradigm needs human checkpoints at regular intervals. Let the AI work for 30 minutes, then review progress. Repeat. This way you catch drift early before it compounds.

The other thing missing from these benchmarks: recovery ability. When the AI gets stuck on hour 3 of a 4-hour task, can it recognize the problem and backtrack? Or does it confidently continue down the wrong path?

➕ show 2 replies

subdavis • today at 4:41 AM

I recently asked Opus to just “Add vector search” to my current hobby project, a topic I know very little about. It set up manticore, pulled an embedding model, wrote a migration tool for my old keyword indices, and built the front end. I’m not exaggerating much either: the prompt was the length of a tweet.

I think it would easily have taken me 4+ hours to do that. It ran in 15 minutes while I played Kirby Air Riders and worked on the first try.

Afterward, I sort of had to reflect on the fact that I learned essentially nothing about building vector search. I wanted the feature more than I wanted to know how to build the feature. It kept me learning the thing I cared about rather than doing a side quest.

➕ show 6 replies

simonw • today at 4:50 AM

I didn't really understand the "long task" thing until I actually experienced it. The problem is finding a task you can set an agent that justifies working for that long. I finally hit one when I tried porting that Python HTML5 parser to JavaScript by pointing Codex CLI at the 9,200 html5lib-tests test suite: https://simonwillison.net/2025/Dec/15/porting-justhtml/

It's pretty amazing to watch tools-in-a-loop crunch away for >4 hours to solve a generally difficult problem through sheer brute-force.

➕ show 8 replies

0x000xca0xfe • today at 3:12 PM

After spending many hours optimizing some routines I now think performance optimization is a great benchmark for identifiying how generally smart an AI is at helping with some specific piece of code.

Solutions are quite easy to verify with differential testing and produce a number for direct comparison.

Less code is usually better and you generally can't "cheat" by adding more cruft so it nullifies the additive bias. Good optimization requires significant understanding of the underlying structures. Everything has performance tradeoffs so it requires systemic thinking and not just stringing independent pieces together.

So far I've found that Gemini Pro 3 was the best at reasoning about tricky SIMD code but the results with most models were pretty underwhelming.

twotwotwo • today at 6:13 AM

I'm conflicted about opining on models: no individual has actually done a large sample of real-world tasks with a lot of models to be able to speak with authority, but I kinda think we should each share our dubiously-informed opinions anyway because benchmarks aren't necessarily representative of real-world use and many can clearly be gamed.

Anyhow, I noticed more of a difference trying Opus 4.5 compared to Sonnet 4.5 than I'd noticed from, for example, the last couple Sonnet bumps. Objectively, at 1.66x Sonnet's price instead of the old 5x, it's much more often practical to consider reaching for than past Opus models. Anthropic's basic monthly thing also covers a fair amount of futzing with it in CC.

At the other extreme, another surprise of this family is that Haiku 4.5 with reasoning on is usable: better than Sonnet with thinking off according to some bencharks, and in any case subjectively decent for point edits, single-page thingies, and small tools.

pugio • today at 5:00 AM

Opus looks like a big jump from the previous leader (GPT 5.1), but when you switch from "50%" to "80%", GPT 5.1 still leads by a good margin. I'm not sure if you can take much from this - perhaps "5.1 is more reliable at slightly shorter stuff, choose Opus if you're trying to push the frontier in task length".

➕ show 1 reply

atleastoptimal • today at 7:22 AM

They should do a 95% and 99% version of the graphs, otherwise it's hard to ascertain whether the failure cases will remain in the elusive "stuff humans can do easily but LLM's trip up despite scaling"

zkmon • today at 4:57 PM

> We believe this work has important implications ... > First, our work demonstrates an approach ...

The Conclusions section is not for making a sales pitch for your article. It is for summarizing any new knowledge the article brings out.

rich_sasha • today at 2:33 PM

How does "cost" per frontier task change with time?

Extrapolating any exponential growth is always dangerous, but over say 3 years at this pace, we'd go from 2 hours to 70,or about 8 days' work.

Quite scary. But what does cost do over the same timeline? Does it increase with computational complexity? Is it worse - because, IIRC, transformers computational cost is quadratic in context length. Is it better - some kind of economies of scale?

I glanced thought the article but couldn't find any info on this.

karimQuant • today at 5:19 AM

The big issue is the 50%, if you switch to 80% it's much less. Now if you are in the wrong side of 50% given the task was 4hours. How much additional time to 4hours you need. repeat trying to get the task done 50%*50%->25% , 50%^4 -> 6.25%. the cost of bad luck is very high.

yismail • today at 4:47 AM

Would be interesting to see Gemini 3.0 Pro benchmarked as well.

➕ show 1 reply

grim_io • today at 4:30 AM

This seems like a good way to measure LLM improvement.

It matches the my personal feeling when using progressively better models over time.

NiloCK • today at 9:33 AM

I appreciate horizon expansion as a fundamental metric, but duration seems like too crude a measure. We used to like it when computers were fast.

An infinitely unscrupulous model provider could double this five hour result by cutting your output tokens/second in half!

This isn't only a question of gaming the metric: the very strong current small-fast models (4.5 Haiku, Gemini 3 Flash) have no hope of being measured fairly against this - they will succeed or fail much faster just because they are much faster.

How about something like total output token count as the "long term horizon" metric instead?

➕ show 2 replies

iLoveOncall • today at 9:03 AM

> current models have almost 100% success rate on tasks taking humans less than 4 minutes

The contrary is easily verifiable by everyone individually. It's nowhere near 100%, or even 50% for few minutes tasks even with the best models in real world situations.

➕ show 1 reply

Aperocky • today at 5:14 AM

I think the problem here is LLM eventually pollute its context window with so much of the current task that the larger picture or architectural sanity is forgotten in favor of the current task at hand.

And rarely is a software one and done, with a few round like this, the software architecture would have become schizophrenic. Combating this tendency usually require a lot of the work of these "long task" to be thrown away and more closely limiting what the AI is trying to do as they happen. The success of one "long task" is not necessarily a good thing!

scotty79 • today at 1:51 PM

> As shown above, when we fit a similar trend to just the 2024 and 2025 data, this shortens the estimate of when AI can complete month-long tasks with 50% reliability by about 2.5 years.

I don't think I have 50% success rate at month long tasks.

Anything that exceeds one day is pretty hard.

Davidzheng • today at 9:01 AM

Big error bars and METR people are saying the longer end of the benchmark are less accurate right now. I think they mean this is a lower bound!

➕ show 1 reply

alexgotoi • today at 4:48 AM

[dead]

bentobean • today at 5:45 AM

> We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months.

If true, how much of this is a result of:

1. Genuine technical advancement

or:

2. Shoveling trillions of dollars into compute resources in order to service incoming LLM requests in a way that is completely unrealistic over the long term?

In other words… are we talking about genuine, sustainable innovation that we get to take with us moving forward and benefit from? Or are we talking about an “improvement” that is more akin to a mirage that will eventually disappear when the Ponzi scheme eventually collapses?

➕ show 3 replies

nrhrjrjrjtntbt • today at 5:20 AM

Why measure in minutes and not tokens? Seems you could cheat by slowing the ai down.

➕ show 1 reply

Dwedit • today at 4:35 AM

Opus is already the name of an audio codec.

➕ show 3 replies

alt Hacker News

Measuring AI Ability to Complete Long Tasks

Comments