> Our latest frontier models have shown particular strengths in their ability to do long-running ...

nikkwong • yesterday at 6:38 PM • 7 replies • view on HN

> Our latest frontier models have shown particular strengths in their ability to do long-running tasks, working autonomously for hours, days or weeks without intervention.

I have yet to see this (produce anything actually useful).

Replies

simonw • yesterday at 6:56 PM

How hard have you tried?

I've been finding that the Opus 4.5/4.6 and GPT-5.2/5.3 models really have represented a step-change in how good they are at running long tasks.

I can one-shot prompt all sorts of useful coding challenges now that previously I would have expected to need multiple follow-ups to fix mistakes the agents made.

I got all of this from a single prompt, for example: https://github.com/simonw/research/tree/main/cysqlite-wasm-w... - including this demo page: https://simonw.github.io/research/cysqlite-wasm-wheel/demo.h... - using this single prompt: https://github.com/simonw/research/pull/79

➕ show 2 replies

gamegoblin • yesterday at 6:50 PM

I routinely leave codex running for a few hours overnight to debug stuff

If you have a deterministic unit test that can reproduce the bug through your app front door, but you have no idea how the bug is actually happening, having a coding agent just grind through the slog of sticking debug prints everywhere, testing hypotheses, etc — it's an ideal usecase

➕ show 3 replies

XCSme • yesterday at 6:44 PM

Their ability to burn through tokens non-stop for hours, days or weeks without intervention.

➕ show 1 reply

TheMuenster • yesterday at 9:20 PM

Can I just say how funny this metric is?

"Our model is so slow and our tokens/second is so low that these tasks can take hours!" is not the advertising they think it is.

johnfn • yesterday at 7:28 PM

The other day I got Codex to one-shot an upgrade to Vite 8 at my day job (a real website with revenue). It worked in this for over 3 hours without intervention (I went to sleep). This is now in production.

➕ show 1 reply

wahnfrieden • yesterday at 7:01 PM

It worked for me several times.

It's easy to say that these increasingly popular tools are only able to produce useless junk. You haven't tried, or you haven't "closed the loop" so that the agent can evaluate its own progress toward acceptance criteria, or you are monitoring incompetent feeds of other users.

➕ show 1 reply

bitwize • yesterday at 7:38 PM

PEBKAC

alt Hacker News

Replies