Really, this. You still need to check its work, but it is also pretty good at checking its work if told to look at specific things.
Make it stop. Tell it to review whether the code is cohesive. Tell it to review it for security issues. Tell it to review it for common problems you've seen in just your codebase.
Tell it to write a todo list for everything it finds, and tell it fix it.
And only review the code once it's worked through a checklist of its own reviews.
We wouldn't waste time reviewing a first draft from another developer if they hadn't bothered looking over it and test it properly, so why would we do that for an AI agent that is far cheaper.
So why can't the deterministic part of the agent program embed in all these checks?
Tell it to grade its work in various categories and that you'll only accept B+ or greater work. Focusing on how good it's doing is an important distinction.
I wouldn't mind see a collection of objectives and the emitted output. My experience with LLM output is that they are very often over-engineered for no good reason, which is taxing on me to review.
I want to see this code written to some objective, to compare with what I would have written to the same objective. What I've seen so far are specs so detailed that very little is left to the discretion of the LLM.
What I want to see are those where the LLM is asked for something, and provided it because I am curious to compare it to my proposed solution.
(This sounds like a great idea for a site that shows users the user-submitted task, and only after they submit their attempt does it show them the LLM's attempt. Someone please vibe code this up, TIA)