the quadratic curve makes sense but honestly what kills us more is the review cost - AI generates code fast but then you're stuck reading every line because it might've missed some edge case or broken something three layers deep. we burn more time auditing AI output than we save on writing it, and that compounds. the API costs are predictable at least
> AI generates code fast but then you're stuck reading every line because it might've missed some edge case or broken something three layers deep
If the abstraction that the code uses is "right", there will be hardly any edge cases, and something to break three layers deep.
Even though I am clearly an AI-hater, for this very specific problem I don't see the root cause in these AI models, but in the programmers who don't care about code quality and thus brutally reject code that is not of exceptional quality.
To eliminate this tax I break anything gen-ai does in to the smallest chunks possible.
> AI generates code fast but then you're stuck reading every line because it might've missed some edge case or broken something three layers deep
I will imagine that in the future this will be tackled with a heavy driven approach and tight regulation of what the agent can and cannot touch. So frequent small PRs over big ones. Limit folder access to only those that need changing. Let it build the project. If it doesn't build, no PR submissions allowed. If a single test fails, no PR submissions allowed. And the tests will likely be the first if not the main focus in LLM PRs.
I use the term "LLM" and not "AI" because I notice that people have started attributing LLM related issues (like ripping off copyrighted material, excessive usage of natural resources, etc) to AI in general which is damaging for the future of AI.
What surprises me is that this obvious inefficiency isn't competed out of the market. Ie this is clearly such a suboptimal use of time and yet lots of companies do it and don't get competed out by other ones that don't do this
[dead]
I disagree. I used to spend most of my time writing code, fixing syntax, thinking through how to structure the code, looking up documentation on how to use a library.
Now I first discuss with an AI Agent or ChatGPT to write a thorough spec before handing it off to an agent to code it. I don’t read every line. Instead, I thoroughly test the outcome.
Bugs that the AI agent would write, I would have also wrote. Example is unexpected data that doesn’t match expectations. Can’t fault the AI for those bugs.
I also find that the AI writes more bug free code than I did. It handles cases that I wouldn’t have thought of. It used best practices more often than I did.
Maybe I was a bad dev before LLMs but I find myself producing better quality applications much quicker.
The review cost problem is really an observability problem in disguise.
You shouldn't need to read every line. You should have test coverage, type checking, and integration tests that catch the edge cases automatically. If an AI agent generates code that passes your existing test suite, linter, and type checker, you've reduced the review surface to "does this do what I asked" rather than "did it break something."
The teams I've seen succeed with coding agents treat them like a junior dev with commit access gated behind CI. The agent proposes, CI validates, human reviews intent not implementation. The ones struggling are the ones doing code review line-by-line on AI output, which defeats the purpose entirely.
The real hidden cost isn't the API calls or the review time - it's the observability gap. Most teams have no idea what their agents are actually doing across runs. No cost-per-task tracking, no quality metrics per model, no way to spot when an agent starts regressing. You end up flying blind and the compounding costs you mention are a symptom of that.
> then you're stuck reading every line because it might've missed some edge case or broken something
This is what tests are for. Humans famously write crap code. They read it and assume they know what's going on, but actually they don't. Then they modify a line of code that looks like it should work, and it breaks 10 things. Tests are there to catch when it breaks so you can go back and fix it.
Agents are supposed to run tests as part of their coding loops, modifying the code until the tests pass. Of course reward hacking means the AI might modify the test to 'just pass' to get around this. So the tests need to be protected from the AI (in their own repo, a commit/merge filter, or whatever you want) and curated by humans. Initial creation by the AI based on user stories, but test modifications go through a PR process and are scrutinized. You should have many kinds of tests (unit, integration, end-to-end, regression, etc), and you can have different levels of scrutiny (maybe the AI can modify unit tests on the fly, and in PRs you only look at the test modifications to ensure they're sane). You can also have a different agent with a different prompt do a pre-review to focus only on looking for reward hacks.