IMHO, It's not the oneshotting. It's the "starting from empty slate" greenfiel...

rdsubhas • today at 11:29 AM • 2 replies • view on HN

IMHO, It's not the oneshotting.

It's the "starting from empty slate" greenfield that's the real problem.

We used to make fun of Engineers who follow a README on a framework, test it on an empty project, and say "this framework is the best for our 10 year running production app". Greenfield mentality is always the solution to all problems and problem to all solutions.

One should still measure oneshotting, it's an important self-measurement metric - but against an established, large codebase.

Replies

keheliya • today at 11:41 AM

There are upcoming benchmarks aimed at measuring the ability to work with brownfield tasks. (Of course, benchmarks can be gamed, but they are still better than unrealistic toy tasks that earlier generations of benchmarks used. Frontier labs are yet to use them in their tech reports or marketing material, though.:-)

* SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios https://arxiv.org/abs/2512.18470 * SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration https://arxiv.org/abs/2603.03823

bluGill • today at 1:42 PM

At least they did some analysis. I've couple AI slop "X is the best tool for the job" that didn't even try it. (Worse, we are already using QT which has a tool for the job, and the QT tool works with the rest of the QT ecosystem unlike whatever AI told them)

alt Hacker News

Replies