I've been trying to get agentic coding to work, but the dissonance between what I'm seeing online and what I'm able to achieve is doing my head in.
Is there real evidence, beyond hype, that agentic coding produces net-positive results? If any of you have actually got it to work, could you share (in detail) how you did it?
By "getting it to work" I mean: * creating more value than technical debt, and * producing code that’s structurally sound enough for someone responsible for the architecture to sign off on.
Lately I’ve seen a push toward minimal or nonexistent code review, with the claim that we should move from “validating architecture” to “validating behavior.” In practice, this seems to mean: don’t look at the code; if tests and CI pass, ship it. I can’t see how this holds up long-term. My expectation is that you end up with "spaghetti" code that works on the happy path but accumulates subtle, hard-to-debug failures over time.
When I tried using Codex on my existing codebases, with or without guardrails, half of my time went into fixing the subtle mistakes it made or the duplication it introduced.
Last weekend I tried building an iOS app for pet feeding reminders from scratch. I instructed Codex to research and propose an architectural blueprint for SwiftUI first. Then, I worked with it to write a spec describing what should be implemented and how.
The first implementation pass was surprisingly good, although it had a number of bugs. Things went downhill fast, however. I spent the rest of my weekend getting Codex to make things work, fix bugs without introducing new ones, and research best practices instead of making stuff up. Although I made it record new guidelines and guardrails as I found them, things didn't improve. In the end I just gave up.
I personally can't accept shipping unreviewed code. It feels wrong. The product has to work, but the code must also be high-quality.
I've found it useful for getting features started and fixing bugs, but it depends on the feature. I use Claude Sonnet 4.5 and it usually does a pretty good job on well-known problems like setting up web sockets and drag and drop UIs which would take me much longer to do by hand. It also seems to follow examples well of existing patterns in my codebase like router/service/repository implementations. I've struggled to get it to work well for messy complicated problems like parsing text into structured objects that have thousands of edge cases and in which the complexity gets out of hand very quickly if not careful. In these cases I write almost all the code by hand. I also use it for writing ad-hoc scripts I need to run once and are not safety critical, in which case I use it's code as-is after a cursory review that it is correct. Sometimes I build features I would otherwise be too intimidated to try if doing by hand. I also use it to write tests, but I usually don't like it's style and tend to simplify them a lot. I'm sure my usage will change over time as I refine what works and what doesn't for me.
A principal engineer at Google posted on Twitter that Claude Code did in an hour what the team couldn’t do in a year.
Two days later, after people freaked out, context was added. The team built multiple versions in that year, each had its trade offs. All that context was given to the AI and it was able to produce a “toy” version. I can only assume it had similar trade offs.
https://xcancel.com/rakyll/status/2007659740126761033#m
My experience has been similar to yours, and I think a lot of the hype is from people like this Google engineer who play into the hype and leave out the context. This sets expectations way out of line from reality and leads to frustration and disappointment.
I use Augment with Claud Opus 4.5 every day at my job. I barely ever write code by hand anymore. I don't blindly accept the code that it writes, I iterate with it. We review code at my work. I have absolutely found a lot of benefit from my tools.
I've implemented several medium-scale projects that I anticipate would have taken 1-2 weeks manually, and took a day or so using agentic tools.
A few very concrete advantages I've found:
* I can spin up several agents in parallel and cycle between them. Reviewing the output of one while the others crank away.
* It's greatly improved my ability in languages I'm not expert in. For example, I wrote a Chrome extension which I've maintained for a decade or so. I'm quite weak in Javascript. I pointed Antigravity at it and gave it a very open-ended prompt (basically, "improve this extension") and in about five minutes in vastly improved the quality of the extension (better UI, performance, removed dependencies). The improvements may have been easy for someone expert in JS, but I'm not.
Here's the approach I follow that works pretty well:
1. Tell the agent your spec, as clearly as possible. Tell the agent to analyze the code and make a plan based on your spec. Tell the agent to not make any changes without consulting you.
2. Iterate on the plan with the agent until you think it's a good idea.
3. Have the agent implement your plan step by step. Tell the agent to pause and get your input between each step.
4. Between each step, look at what the agent did and tell it to make any corrections or modifications to the plan you notice. (I find that it helps to remind them what the overall plan is because sometimes they forget...).
5. Once the code is completed (or even between each step), I like to run a code-cleanup subagent that maintains the logic but improves style (factors out magic constants, helper functions, etc.)
This works quite well for me. Since these are text-based interfaces, I find that clarity of prose makes a big difference. Being very careful and explicit about the spec you provide to the agent is crucial.
You fundamentally misunderstand AI assisted coding if you think it does the work for you, or that it gets it right, or that it can be trusted to complete a job.
It is an assistant not a team mate.
If you think that getting it wrong, or bugs, or misunderstandings, or lost code, or misdirections, are AI "failing", then yes you will fail to understand or see the value.
The point is that a good AI assisted developer steers through these things and has the skill to make great software from the chaotic ingredients that AI brings to the table.
And this is why articles like this one "just don't get it", because they are expecting the AI to do their job for them and holding it to the standards of a team mate. It does not work that way.
The only approach I've tried that seems to work reasonably well, and consistently, was the following:
Make a commit.
Give Claude a task that's not particularly open ended, the closer to pure "monkey work" boilerplate nonsense the task is, the better (which is also the sort of code I don't want do deal with myself).
Preferably it should be something that only touches a file or two in the codebase unless it is a trivial refactor (like changing the same method call all over the place)
Make sure it is set to planning mode and let it come up with a plan.
Review the plan.
Let it implement the plan.
If it works, great, move on to review. I've seen it one-shot some pretty annoying tasks like porting code from one platform to another.
If there are obvious mistakes (program doesn't build, tests don't pass, etc.) then a few more iterations usually fix the issue.
If there are subtle mistakes, make a branch and have it try again. If it fails, then this is beyond what it can do, abort the branch and solve the issue myself.
Review and cleanup the code it wrote, it's usually a lot messier than it needs to be. This also allows me to take ownership of the code. I now know what it does and how it works.
I don't bother giving it guidelines or guardrails or anything of the sort, it can't follow them reliably. Even something as simple as "This project uses CMake, build it like this" was repeatedly ignored as it kept trying to invoke the makefile directly and in the wrong folder.
This doesn't save me all that much time since the review and cleanup can take long, but it serves a great unblocker.
I also use it as a rubber duck that can talk back and documentation source. It's pretty good for that.
This idea of having an army of agents all working together on the codebase is hilarious to me. Replace "agents" with "juniors I hired on fiverr with anterograde amnesia" and it's about how well it goes.
I have started to use it to write small throwaway things. Like write a standalone debug shader that can display all this state on top of this image in real time. Not in a million years would I had spent time to mess with fonts in a shading language or bring in immediate gui framework or such. Codex could oneshot that kind of thing and the blast radius is one file that is not part of the project. Or write a separate python program that implements this core logic and double check my thinking. I am not a professional programmer though.
I have the same experience despite using claude every day. As an funny anecdote:
Someone I know wrote the code and the unit tests for a new feature with an agent. The code was subtly wrong, fine, it happens, but worse the 30 or so tests they added added 10 minutes to the test run time and they all essentially amounted to `expect(true).to.be(true)` because the LLM had worked around the code not working in the tests
Learning how to drive the models is a legit skill - and I don't mean "prompt engineering". There are absolutely techniques that help and because things are moving fast there is little established practice to draw from. But it's also been interesting seeing experienced coders struggle - I've found my time as a manager has been more help to me than my time as a coder. How to keep people on task and focused etc is very similar to managing humans. I suspect much of the next 5 years will be people rediscovering existing human and project management techniques and rebranding them as AI something.
Some techniques I've found useful recently:
- If the agent struggled on something once it's done I'll ask it "you were struggling here, think about what happened and if there are is anything you learned. Put this into a learnings document and reference it in agents.md so we don't get stuck next time"
- Plans are a must. Chat to the agent back and forth to build up a common understanding of the problem you want solved. Make sure to say "ask me any follow up questions you think are necessary". This chat is often the longest part of the project - don't skimp on it. You are building the requirements and if you've ever done any dev work you understand how important having good requirements are to the success of the work. Then ask the model to write up the plan into an implementation document with steps. Review this thoroughly. Then use a new agent to start work on it. "Implement steps 1-2 of this doc". Having the work broken down into steps helps to be able to do work more pieces (new context windows). This part is the more mindless part and where you get to catch up on reading HN :)
- The GitHub Copilot chat agent is great. I don't get the TUI folks at all. The Pro+ plan is a reasonable price and can do a lot with it (Sonnet, Codex, etc all available). Being able to see the diffs as it works is helpful (but not necessary) to catch problems earlier.
For me, the only metric that matters is wall-time between initial idea and when it's solid enough that you don't have to think about it.
Agentic coding is very similar to frameworks in this regard:
1. If the alignment is right, you have saved time.
2. If it's not right, it might take longer.
3. You won't have clear evidence of which of these cases applies until changing course becomes too expensive.
4. Except, in some cases, this doesn't apply and it's obvious
My colleague coded a feature with Code Claude in a day. The code looks good, also seemingly works. The code was reviewed and pushed out to production.
The problem: there is no way, he verified the code in any way. The business logic behind the feature would take probably few days to check for correctness. But if it looks good -> done. Let the customer check it. Of course, he claims “he reviewed it”.
It feels to me, we just skip doing half the things proper senior devs did, and claim we’re faster.
I used Claude Opus 4.5 inside Cursor to write RISC-V Vector/SIMD code. Specifically Depthwise Convolution and normal Convolution layers for a CNN.
I started out by letting it write a naive C version without intrinsic, and validated it against the PyTorch version.
Then I asked it (and two other models, Gemini 3.0 and GPT 5.1) to come up with some ideas on how to make it faster using SIMD vector instructions and write those down as markdown files.
Finally, I started the agent loop by giving Cursor those three markdown files, the naive C code and some more information on how to compile the code, and also an SSH command where it can upload the program and test it.
It then tested a few different variants, ran it on the target (RISC-V SBC, OrangePI RV2) to check if it improves runtime, and then continue from there. It did this 10 times, until it arrived at the final version.
The final code is very readable, and faster than any other library or compiler that I have found so far. I think the clear guardrails (output has to match exactly the reference output from PyTorch, performance must be better than before) makes this work very well.
I had a fairly big custom Python 2 static website generator ( github.com/csplib/csplib ), which I'd about given up transfering to Python 3, after a couple of aborted attempts. My main issue was that the libraries I was using didn't have Python 3 versions.
AN AI managed to do basically the whole transfer. One big help is I said "The website output of the current version should be identical", so I had an easy way to test for correctness (assuming it didn't try cheating by saving the website of course, but that's easy for me to check for)
My experience has been it does pretty well at writing a "rough draft" with sufficiently good instructions (in particular, telling it a general direction of how to implement it, rather than just telling it to what the end goal is). Then maybe do one or two passes at having the agent improve on that draft, then fix the rest by hand.
Hang in there. Yes it is possible; I do it every day. I also do iOS and my current setup is: Cursor + Claude Opus 4.5.
You still need to think about how you would solve the problem as an engineer and break down the task into a right-sized chunk of work. i.e. If 4 things need to change, start with the most fundamental change which has no other dependencies.
Also it is important to manage the context window. For a new task, start a new "chat" (new agent). Stay on topic. You'll be limited to about five back-and-forths before performance starts to suffer. (cursor shows a visual indicator of this in the for of the circle/wheel icon)
For larger tasks, tap the Plan button first, and guide it to the correct architecture you are looking for. Then hit build. Review what it did. If a section of code isn't high-quality, tell Claude how to change it. If it fails, then reject the change.
It's a tool that can make you 2 - 10x more productive if you learn to use it well.
Sure, here are my own examples:
* I came up with a list of 9 performance improvement ideas for an expensive pipeline. Most of these were really boring and tedious to implement (basically a lot of special cases) and I wasn't sure which would work, so I had Claude try them all. It made prototypes that had bad code quality but tested the core ideas. One approach cut the time down by 50%, I rewrote it with better code and it's saved about $6,000/month for my company.
* My wife and I had a really complicated spreadsheet for tracking how much we owed our babysitter – it was just complex enough to not really fit into a spreadsheet easily. I vibecoded a command line tool that's made it a lot easier.
* When AWS RDS costs spiked one month, I set Claude Code to investigate and it found the reason was a misconfigured backup setting
* I'll use Claude to throw together a bunch of visualizations for some data to help me investigate
* I'll often give Claude the type signature for a function, and ask it to write the function. It generally gets this about 85% right
Define "works"
Easiest way to get value is building tests. These don't ship.
You can get value from LLM as an additional layer of linting. Reviews don't ship either.
You can use LLM for planning. They can quickly scan across the codebases catch side effects of proposed changes or do gap analysis from the desired state.
Argumenting that agentic coding must be on or off seem very limiting.
When you first began learning how to program were you building and shipping apps the next day? No.
Agentic programming is a skill-set and a muscle you need to develop just like you did with coding in the past.
Things didn’t just suddenly go downhill after an arbitrary tipping point - what happened is you hit a knowledge gap in the tooling and gave up.
Reflect on what went wrong and use that knowledge next time you work with the agent.
For example, investing the time in building a strong test suite and testing strategy ahead of time which both you and the agent can rely on.
Being able to manage the agent and getting quality results on a large, complex codebase is a skill in itself, it won’t happen over night.
It takes practice and repetition with these tools to level-up, just like any thing else.
1. Start with a plan. Get AI to help you make it, and edit.
2. Part of the plan should be automated tests. AI can make these for you too, but you should spot check for reasonable behavior.
3. Use Claude 4.5 Opus
4. Use Git, get the AI to check in its work in meaningful chunks, on its own git branch.
5. Ask the AI to keep am append-only developer log as a markdown file, and to update it whenever its state significantly changes, or it makes a large discovery, or it is "surprised" by anything.
This is anecdotal and maybe reflects what other people are seeing.
If you know the field you want it to work in, then it can augment what you do very well.
Without that they all tend to create hot garbage that looks cool to a layperson.
I would also avoid getting it to write the whole thing up front. Creating a project plan and requirements can help ground them somewhat.
A loop I've found that works pretty well for bugs is this:
- Ask Claude to look at my current in-progress task (from Github/Jira/whatever) and repro the bug using the Chrome MCP.
- Ask it to fix it
- Review the code manually, usually it's pretty self-contained and easy to ensure it does what I want
- If I'm feeling cautious, ask it to run "manual" tests on related components (this is a huge time-saver!)
- Ask it to help me prepare the PR: This refers to instructions I put in CLAUDE.md so it gives me a branch name, commit message and PR description based on our internal processes.
- I do the commit operations, PR and stuff myself, often tweaking the messages / description.
- Clear context / start a new conversation for the next bug.
On a personal project where I'm less concerned about code quality, I'll often do the plan->implementation approach. Getting pretty in-depth about your requirements ovbiously leads to a much better plan. For fixing bugs it really helps to tell the model to check its assumptions, because that's often where it gets stuck and create new bugs while fixing others.
All in all, I think it's working for me. I'll tackle 2-3 day refactors in an afternoon. But obviously there's a learning curve and having the technical skills to know what you want will give you much better results.
Coding agent is a perfect simulation of a junior developer working under you. Developer that will tell you - “yes I can do that” about any language and any problem and will never ask you any questions trying very hard to appear competent.
Your job is to put them in constraints and give granular and clear tasks. Be aware that junior developer has very basic knowledge about architecture.
The good is that it does not simulate that part when developer tries shift blame or pin it on you. Because you’re to blame at all times.
Any real senior devs here using agentic coding?
I've been using agentic coding tools for the past year and a half, and the pattern I've observed is that they work best when they're treated as a very fast, very knowledgeable junior developer, not completely as "autonomous engineer".
When I try to give agents broad architectural tasks, they flounder. When I constrain them to small, well-defined units of work within an existing architecture, they can produce clean, correct code surprisingly often.
My experience is the same. In short, agents cannot plan ahead, or plan at a high level. This means they have a blindspot for design. Since they cannot design properly, it limits the kind of projects that are viable to smaller scopes (not sure exactly how small but in my experience, extremely small and simple). Anything that exceeds this abstract threshold has a good chance of being a net negative, with most of the code being unmantainable, unextensible, and unreliable.
Anyone who claims AI is great is not building a large or complex enough app, and when it works for their small project, they extrapolate to all possibilities. So because their example was generated from a prompt, it's incorrectly assumed that any prompt will also work. That doesn't necessarily follow.
The reality is that programming is widely underestimated. The perception is that it's just syntax on a text file, but it's really more like a giant abstract machine with moving parts. If you don't see the giant machine with moving parts, chances are you are not going to build good software. For AI to do this, it would require strong reasoning capabilities, that lets it derive logical structures, along with long term planning and simulation of this abstract machine. I predict that if AI can do this then it will be able to do every single other job, including physical jobs as it would be able to reason within a robotic body in the physical world.
To summarize, people are underestimating programming, using their simple projects to incorrectly extrapolate to any possible prompt, and missing the hard part of programming which involves building abstract machines that work on first principles and mathematical logic.
> When I tried using Codex on my existing codebases, with or without guardrails, half of my time went into fixing the subtle mistakes it made or the duplication it introduced.
If you want to get good at this, when it makes subtle mistakes or duplicates code or whatever, revert the changes and update your AGENTS.md or your prompt and try again. Do that until it gets it right. That will take longer than writing it yourself. It's time invested in learning how to use these and getting a good setup in your codebase for them.
If you can't get it to get it right, you may legitimately have something it sucks at. Although as you iterate might also have some other insights into why it keeps getting it wrong and can maybe change something more substantial about your setup to make it able to get it right.
For example I have a custom xml/css UI solution that draws inspiration both from XML and SwiftUI, and it does an OK job of making UIs for it. But sometimes it gets stuck in ways it wouldn't if it was using HTML or some known (and probably higher quality/less buggy) UI library. I noticed it keeps trying things, adding redundant markup to both the xml and css, using unsupported attributes that it thinks should exist (because they do in HTML/CSS), and never cleans up on the way.
Some amount of fixing up its context made it noticeably better at this but it still gets stuck and makes a mess when it does. So I made it write a linter and now it uses the linter constantly which keeps it closer to on the rails.
Your pet feeding app isn't in this category. You can get a substantial app pretty far these days without running into a brick wall. Hitting a wall that quickly just means you're early on the learning curve. You may have needed to give it more technical guidance from the start, and have it write tests for everything, make sure it makes the app observable to itself in some way so it can see bugs itself and fix them, stuff like that.
I've been having good results lately/finally with Opus 4.5 in Cursor. It still isn't one-shotting my entire task, but the 90% of the way it gets me is pretty close to what I wanted, which is better than in the past. I feel more confident in telling it to change things without it making it worse. I only use it at work so I can't share anything, but I can say I write less code by hand now that it's producing something acceptable.
For sysops stuff I have found it extremely useful, once it has MCP's into all relevant services, I use it as the first place I go to ask what is happening with something specific on the backend.
My gf manages to get paid using Cursor/Copilot, despite not being able to branch herself out of a loop
In my experience Copilots work expertly at CRUD'ing inside a well structured project, and for MVPs in languages you aren't an expert on (Rust, C/C++ in my case)
The biggest demerit is that agents are increasingly trying to be "smart" and using powershell search/replace or writing scripts to skimp on tokens, with results that make me unreasonably angry
I tried adding i18n to an old react project, and copilot used all my credits + 10 USD because it kept shitting everything up with its maddening, idiotic use if search replace
If it had simply ingested each file and modified them only once, it would have been cheaper
As you can tell, I am still salty about it
> The product has to work, but the code must also be high-quality.
I understand and admire your commitment to code quality. I share similar ideals.
But it's 2026 and you're asking for evidence that agentic coding works. You're already behind. I don't think you're going to make it. Your competitors are going to outship you.
In most cases, your customers don't care about your code. They only want something that works right.
Track Mitsuhiko's work and blog posts:
https://news.ycombinator.com/user?id=the_mitsuhiko
https://lucumr.pocoo.org/about/
He has an extensive and impressive body of work in Python and Rust pre LLMs. He's now working on his own startup and doing much of it with AI and documenting his journey. I trust his opinions even though I don't use LLMs as much as he does.
As far as I can tell, there are exactly 3 use cases that have demonstrably worked with AI, in the sense that their stakeholders (not the AI companies, the users) swear it works.
1. training a RAG on support questions for chat or documentation, w/good material
2. people doing GTM work in marketing, for things like email automation
3. people using a combination of expensive tools - Claude + Cursor + something else (maybe n8n, maybe a custom coding service) - to make greenfield apps
Yes. Can I share it? No, sadly. It definitely works - but I think sometimes expectations are too high is all.
For me its a major change for personal projects. That said, since about 3 months ago VS Code Github Copilot is remarkably stable in working with existing code base and I could implement changes to those projects that would have taken me a substantially longer time. So at least for this use-case its there. Hidden game changers are Gradio/Streamlit for easy UI.
I still think it's useful, but you have to make a heavy use of the 'plan' mode. I still ask the new hires to avoid doing more than just the plan (or at most generating tests cases), so they can understand the codebase before generating new code inside.
Basically my point of view is that if you don't feel comfortable reviewing your coworkers code, you shouldn't generate code with AI, because you will review it badly and then I will have to catch the bugs and fix it (happened 24 hours ago). If you generate code, you better understand where it can generate side effects.
Yes. Over the last month, I've made heavy use of agentic coding (a bit of Junie and Amp, but mostly Antigravity) to ship https://www.ratatui-ruby.dev from scratch. Not just the website... the entire thing.
The main library (rubygem) has 3,662 code lines and 9,199 comment lines of production Ruby and 4,933 code lines and 710 comment lines of Rust. There are a further 6,986 code lines and 2,304 comment lines of example applications code using the library as documentation, and 4,031 lines of markdown documentation. Plus, 15,271 code lines and 2,159 comment lines of automated tests. Oh, and 4,250 lines in bin/ and tasks/ but those are lower-quality "internal" automation scripts and apps.
The library is good enough that Sidekiq is using it to build their TUI. https://github.com/sidekiq/sidekiq/issues/6898
But that's not all I've built over this timeframe. I'm also a significant chunk of the way through an MVU framework, https://rooibos.run, built on top of it. That codebase is 1,163 code lines and 1,420 comment lines of production Ruby, 4,749 code lines and 521 comment lines of automated tests. I need to add to the 821 code lines 221 comment lines of example application code using the framework as documentation, and to the 2,326 lines of markdown documentation.
It's been going so well that the plan is to build out an ecosystem: the core library, an OOP and an FP library, and a set of UI widgets. There are 6,192 lines of markdown in the Wik about it: mailing list archives, AI chat archives, current design & architecture, etc.
For context, I am a long-time hobbyist Rubyist but I cannot write Rust. I have very little idea of the quality of the Rust code beyond what static analyzers and my test suite can tell me.
It's all been done very much in public. You can see every commit going back to December 22 in the git repos linked from the "Sources" tab here: https://sr.ht/~kerrick/ratatui_ruby/ If you look at the timestamps you'll even notice the wild difference between my Christmas vacation days, and when I went back to work and progress slowed. You can also see when I slowed down to work on distractions like https://git.sr.ht/~kerrick/ramforge/tree and https://git.sr.ht/~kerrick/semantic_syntax/tree.
If it keeps going as well as it has, I may be able to rival Charm's BubbleTea and Bubbles by summertime. I'm doing this to give Rubyists the opportunity to participate in the TUI renaissance... but my ultimate goal is to give folks who want to make a TUI a reason to learn Ruby instead of Go or Rust.
I have written a software which I wanted to do so for past 7-8 years within past 3 months. I have over 6000 pages of conversations between me and chatgpt Claude Gemini. And hoping to get patent soon. It consists of over to 260k loc, works well it is architected to support many different industries without much changes but configuration and has very good headed and headless qa coverage. I have spent about 16-18 hours a day because I am so bought into the idea and the outcome I am getting. My patent lawyer suggested getting provisional patent on my work. So for me it works
Depending on the risk profile of the project, it absolutely works with amazing productivity gains. And the agent of today is the worst agent if will ever be because tomorrow its going to be even better. I am finding amazing results with the ideate -> explore -> plan -> code -> test loop.
Yes, agentic coding works and has massive value. No, you can't just deploy code unreviewed.
Still takes much less time for me to review the plan and output than write the code myself.
I have had similar questions, and am still evaluating here. However, I've been increasingly frustrated with the sheer volume of anecdotal evidence from yay and naysayers of LLM-assisted coding. I have personally felt increased productivity at times with it, and frustrations at others.
In order to better research, I built (ironically, mostly vibe coded) a tool to run structured "self-experiments" on my own usage of AI. The idea is I've init a bunch of hypotheses I have around my own productivity/fulfillment/results with AI-assisted coding. The tool lets me establish those then run "blocks" where I test a particular strategy for a time period (default 2 weeks). So for example, I might have a "no AI" block followed by a "some AI" block followed by a "full agent all-in AI block".
The tool is there to make doing check-ins easier, basically a tiny CLI wrapper around journaling that stays out of my way. It also does some static analysis on commit frequency, code produced, etc. but I haven't fleshed out that part of it much and have been doing manual analysis at the end of blocks.
For me this kind of self-tracking has been more helpful than hearsay, since I can directly point to periods where it was working well and try to figure out why or what I was working on. It's not fool-proof, obviously, but for me the intentionality has helped me get clearer answers.
Whether those results translate beyond a single engineer isn't a question I'm interested in answering and feels like a variant of developer metrics-black-hole, but maybe we'll get more rigorous experiments in time.
The tool open source here (may be bugs, only been using it a few weeks): https://github.com/wellwright-labs/devex
I've built multiple new apps with it and manage two projects that I wrote. I barely write any code other than frontend, copy, etc.
One is a VSCode extension and has thousands of downloads across different flavors of the IDE -- won't plug it here to spare the downvotes ;)
Been a developer professionally for nearly 20 years. It is 100% replacing most of the things I used to code.
I spend most of my time while it's working testing what it's built to decide on what's next. I also spend way more time on DX of my own setup, improving orchestration, figuring out best practice guidance for the Agent(s), and building reusable tools for my Agents (MCP).
Yep, it works. Like anything getting the most out of these tools is its own (human) skill.
With that in mind, a couple of comments - think of the coding agents as personalities with blind spots. A code review by all of them and a synthesis step is a good idea. In fact currently popular is the “rule of 5” which suggests you need the LLM to review five times, and to vary the level of review, e.g. bugs, architecture, structure, etc. Anecdotally, I find this is extremely effective.
Right now, Claude is in my opinion the best coding agent out there. With Claude code, the best harnesses are starting to automate the review / PR process a bit, but the hand holding around bugs is real.
I also really like Yegge’s beads for LLMs keeping state and track of what they’re doing — upshot, I suggest you install beads, load Claude, run ‘!bd prime’ and say “Give me a full, thorough code review for all sorts of bugs, architecture, incorrect tests, specification, usability, code bugs, plus anything else you see, and write out beads based on your findings.” Then you could have Claude (or codex) work through them. But you’ll probably find a fresh eye will save time, e.g. give Claude a try for a day.
Your ‘duplicated code’ complaint is likely an artifact of how codex interacts with your codebase - codex in particular likes to load smaller chunks of code in to do work, and sometimes it can get too little context. You can always just cat the relevant files right into the context, which can be helpful.
Finally, iOS is a tough target — I’d expect a few more bumps. The vast bulk of iOS apps are not up on GitHub, so there’s less facility in the coding models.
And any front end work doesn’t really have good native visual harnesses set up, (although Claude has the Claude chrome extension for web UIs). So there’s going to be more back and forth.
Anyway - if you’re a career engineer, I’d tell you - learn this stuff. It’s going to be how you work in very short order. If you’re a hobbyist, have a good time and do whatever you want.
try other harnesses than codex.
ive had more success with review tools, rather than the agent getting the code quality right the first time.
current workflow
1. specs/requirements/design, outputting tasks 2. implementation, outputting code and tests 3. run review scripts/debug loops, outputting tasks 4. implement tasks 5. go back to 3
the quality of specs, tasks, and review scripts make a big difference
one of the biggest things that gets the results better is if you can get a feedback loop in from what the app actually does back to the agent. good logs, being able to interact/take screenshots a la playwright etc
guidelines and guardrails are best if theyre tools that the agent runs, or that run automatically to give feedback.
Yes.
Caveat: can't be pure vibes. Needs ownership, care, review and willingness to git reset and try again when needed. Needs a lot of tests.
Cavaet: Greenfield.
Since we are on this topic, how would I make an agent that does this job:
I am writing an automation software that interfaces with a legacy windows CAD program. Depending on the automation, I just need a picture of the part. Sometimes I need part thickness. Sometimes I need to delete parts. Etc... Its very much interacting with the CAD system and checking the CAD file or output for desired results.
I was considering something that would take screenshots and send it back for checks. Not sure what platforms can do this. I am stumped how Visual Studio works with this, there are a bunch of pieces like servers, agents, etc...
Even a how-to link would work for me. I imagine this would be extremely custom.
The way I see it, is that for non-trivial things you have to build your method piece by piece. Then things start to improve. It's a process of... developing a process.
Write a good AGENTS.md (or CLAUDE.md) and you'll see that code is more idiomatic. Ask it to keep a changelog. Have the LLM write a plan before starting code. Ask it to ask you questions. Write abstraction layers it (along with the fellow humans of course) can use without messing with the low-level detail every time.
In a way you have to develop a framework to guide the LLM behavior. It takes time.
My main rule is never to commit code you don’t understand because it’ll get away from you.
I employ a few tricks:
1- I avoid auto-complete and always try to read what it does before committing. When it is doing something I don’t want, I course correct before it continues
2- I ask the LLM questions about the changes it is making and why. I even ask it to make me HTML schema diagrams of the changes.
3- I use my existing expertise. So I am an expert Swift developer, and I use my Swift knowledge to articulate the style of what I want to see in TypeScript, a language I have never worked in professionally.
4- I add the right testing and build infrastructure to put guardrails on its work.
5- I have an extensive library of good code for it to follow.
If you're building something new, stick with languages/problems/projects that have plenty of analogues in the opensource world and keep your context windows small, with small changes.
One-shotting an application that is very bespoke and niche is not going to go well, and same goes for working on an existing codebase without a pile of background work on helping the model understand it piece by piece, and then restricting it to small changes in well-defined areas.
It's like teaching an intern.
When you have a hammer, everything looks like a nail. Ad nauseam.
AI has made it possible for me to build several one-off personal tools in the matter of a couple of hours and has improved my non-tech life as a result. Before, I wouldn't even have considered such small projects because of the effort needed. It's been relieving not to have to even look at code, assuming you can describe your needs in a good prompt. On the other hand, I've seen vibe coded codebases with excessive layers of abstraction and performance issues that came from a possibly lax engineering culture of not doing enough design work upfront before jumping into implementation. It's a classic mistake, that is amplified by AI.
Yes, average code itself has become cheap, but good code still costs, and amazing code, well, you might still have an edge there for now, but eventually, accept that you will have to move up the abstraction stack to remain valuable when pitted against an AI.
What does this mean? Focus on core software engineering principles, design patterns, and understanding what computer is doing at a low level. Just because you're writing TypeScript doesn't mean you shouldn't know what's happening at the CPU level.
I predict the rise in AI slop cleanup consultancies, but they'll be competing with smarter AIs who will clean up after themselves.
I review it as i generate it. for quality. i guide it to be self-testing. create unit tests and integration tests according to my standards
I think one fatal flaw is letting the agent build the app from scratch. I've had huge success with agents, but only on existing apps that were architected by humans and have established conventions and guardrails. Agents are really bad at architecture, but quite good at following suit.
Other things that seem to contribute to success with agents are:
- Static type systems (not tacked-on like Typescript)
- A test suite where the tests cover large swaths of code (i.e. not just unit testing individual functions; you want e2e-style tests, but not the flaky browser kind)
With all the above boxes ticked, I can get away with only doing "sampled" reviews. I.e. I don't review every single change, but I do review some of them. And if I find anything weird that I had missed from a previous change, I to tell it to fix it and give the fix a full review. For architectural changes, I plan the change myself, start working on it, then tell the agent to finish.