For what it's worth, writing good PRs applies in more cases than just AI generated contributions. In my PR descriptions, I usually start by describing how things currently work, then a summary of what needs to change, and why. Then I go on to describe what exactly is changing with the PR. This high level summary serves to educate the reviewer, and acts as a historical record in the git log for the benefit of those who come after you.
From there, I include explicit steps for how to test, including manual testing, and unit test/E2E test commands. If it's something visual, I try to include at least a screenshot, or sometimes even a brief screen capture demonstrating the feature.
Really go out of your way to make the reviewer's life easier. One benefit of doing all of this is that in most cases, the reviewer won't need to reach out to ask simple questions. This also helps to enable more asynchronous workflows, or distributed teams in different time zones.
We should get back to the basic definition of the engineering job. An engineer understands requirements, translates them into logical flows that can be automated, communicates tradeoffs across the organization, and makes tradeoff calls on maintainability, extensibility, readability, and security. Most importantly, they’re accountable for the outcome, because many tradeoffs only reveal their cost once they hit reality
None of this is covered by code generation, nor by juniors submitting random PRs. Those are symptoms of juniors (not only) missing fundamentals. When we forget what the job actually is, we create misalignment with junior engineers and end up with weird ideas like "spec-driven development"
If anything, coding agents are a wake-up call that clarify what engineering profession is really about
I’d go further and say while testing is necessary, it is not sufficient. You have to understand the code and convince yourself that it is logically correct under all relevant circumstances, by reasoning over the code.
Testing only “proves” correctness for the specific state, environment, configuration, and inputs the code was tested with. In practice that only tests a tiny portion of possible circumstances, and omits all kinds of edge and non-edge cases.
Posted down thread, but worth posting as a comment too.
I know Simon follows this "Issue First" style of work in his projects, with a strong requirement for passing tests to be included.
It's been a best practice for a long time. I really enjoyed this when I read it ~10 years ago, and it still stands the test of time:
https://rfc.zeromq.org/spec/42/#24-development-process
The rationale was articulated clearly in:
https://hintjens.gitbooks.io/social-architecture/content/cha...
If you have time, do yourself a favour and read the whole lot. And then liberally copy parts of C4 into your own process. I have advocated for many components of it, in many contexts, at $employer, and will continue to do so.
there’s one depressing anecdote that I keep on seeing: the junior engineer, empowered by some class of LLM tool, who deposits giant, untested PRs on their coworkers—or open source maintainers—and expects the “code review” process to handle the rest.
Is anyone else seeing this in their orgs? I'm not...
I don't test my code because I think it's my duty. I test it because my personal motivation is to see it working! What's the point of writing code if I don't even get to see it run?!
If someone's not even interested and excited to see their code work, they are in the wrong profession.
> Your job is to deliver code you have proven to work.
Strong disagree here, your job is to deliver solutions that help the business solve a problem. In _most_ cases that means delivering code that you should be able to confidently prove satisfies the requirements like the OP mentioned, but I think this is an important nitpick distinction I didn't understand until later on in my career.
I agree with this, except it glosses over security. Your job is to deliver SECURE code that you have proven to work.
Manual and automatic testing are still both required, but you must explicitly ensure that security considerations are included in those tests.
The LLM doesn't care. Caring is YOUR job.
I think the problem is in what “proven” means. People that don’t know any better will just do that all with LLMs and still deliver the giant untested PRs but with some LLM written tests attached.
I vibe code a lot of stuff for myself, mostly for viewing data, when I don’t really need to care how it works. I’m coming around to the idea that outside of some specific circumstances where everyone has agreed they don’t need to care about or understand the code, team vibe coding is a bad practice.
If I’m paying an engineer, it’s for their work, unless explicitly agreed otherwise.
I think vibe coding is soon going to be seen the same way as “research” where you engage an offshore team (common e.g. in consulting) to give you a rundown on some topic and get back the first five google search results. Everyone knows how to do that, if it’s what they wanted they wouldn’t be hiring someone to do it.
The job, in the modern world, is to close tickets. The code quality is negotiable, because the entire automated software process doesn't measure code quality, just statistics.
That's why I refuse to take part in it. But I'm an old-world craftsman by now, and I understand nobody wants to pay for working, well-thought-out code any more. They don't want a Chesterfield; they want plywood and glue.
First problem is turning engineers into accountability sinks. This was a problem before LLMs too, but now a much bigger and structural problem with democratization of the capacity to produce plausible looking dumb code. You will be forced to underwrite more and more of that, and expected to absorb the downsides.
The root cause is the second problem; short of formal verification you can never exhaustively prove that your code works. You can demonstrate and automate that demonstration for a sensible subset of inputs and states and hope for the state of the world approximately staying that way (spoiler: it won't). This is why 100% test coverage in most cases is something bad. This is why sensible is the key operative attitude, which LLM suck at right now.
The root cause of that one is the third problem; your job is to solve a business problem. If your code is not helping the business problem, it actually is not working in the literal sense of the work. It is an artifact that does a thing, but it is not doing work. And since you're downstream of all the self-contradicting, ever changing requirements in a biased framing of a chaotic world, you can never prove or demonstrate that your code solves a business problem and that is the end state.
In my last job (engineering manager for a Japanese high-Quality hardware manufacturer), we were expected to deliver software that works.
In fact, if any bugs were found by the official "last step" QA Department, we (as a software development department) were dinged. If QA found bugs, they could stop the entire product release, so you did not want to be responsible for that.
This resulted in each software development department setting up their own, internal "QC team" of testers. If they found bugs, then individual programmers (or teams) would get dinged, but the main department would not.
Our software got a lot of testing.
There’s an anecdote from one of Djikstra’s essays that strikes at the heart of this phenomenon. I’ll paraphrase because I can’t remember the exact edw number off the top of my head.
A colleague was working on an important subsystem and would ask Djikstra for a review when he thought it was ready. Djikstra would have to stop what he was doing, analyze the code, and would find a grievous error or edge case. He would point it out to the colleague who would then get back to work. The colleague would submit his code for review again and this could carry on enough times that Djikstra got annoyed.
Djikstra proposed a solution. His colleague would have to submit with his code some form of proof or argument as to why it was correct and ready to merge. That way Djikstra could save time by only having to review the argument and not all of the code.
There’s a way of looking at LLM output as Djikstra’s colleague. It puts a lot of burden on the human using this tool to review all of the code. I like Doctorow’s mental model of a reverse centaur. The LLM cannot reason and so won’t provide you with a sound argument. It can probably tell you what it did and summarize the code changes it made… but it can’t decide to merge those changes. It needs a human, the bottom half of the centaur, to do the last bit of work here. Because that’s all we’re doing when we let these tools do most of the work for us: we’re here to take the blame.
And all it takes is an implementation of what we’re trying to build already, every open source library ever, all of SO, a GW of power from a methane power plant, an Olympic pool of water and all of your time reviewing the code it generates.
At the end of the day it’s on you to prove why your changes and contributions should be merged. That’s a lot of work! But there’s no shortcuts. Luckily you can reason while the LLMs struggle with that so use it while you can when choosing to use such tools.
> the junior engineer, empowered by some class of LLM tool, who deposits giant, untested PRs on their coworkers—or open source maintainers—and expects the “code review” process to handle the rest.
I'm noticing something else very similar but involving not necessarily junior roles with long messages, when they use these AI writing assistants that resume stuff, creates follow-ups, etc. Putting this additional burden in whoever needs to read it. It makes me think of a quote that says: "I would have written a shorter letter, but I didn't have the time."
Actually it's more specific than that. A company pays you not just to "write code", not just to "write code that works", but to write code that works in the real world. Not on your laptop. Not in CI tests. Not on some staging environment. But in the real world. It may work fine in a theoretical environment, but deflate like a popped balloon in production. This code has no value to the business; they don't pay you to ship popped balloons.
Therefore you must verify it works as intended in the real world. This means not shipping code and hoping for the best, but checking that it actually does the right thing in production. And on top of that, you have to verify that it hasn't caused a regression in something else in production.
You could try to do that with tests, but tests aren't always feasible. Therefore it's important to design fail-safes into your code that ALERT YOU to unexpected or erroneous conditions. It needs to do more than just log an error to some logging system you never check - you must actually be notified of it, and you should consider it a flaw in your work, like a defective pair of Nikes on an assembly line. Some kind of plumbing must exist to take these error logs (or metrics, traces, whatever) and send it to you. Otherwise you end up producing a defective product, but never know it, because there's nothing in place to tell you its flaws.
Every single day I run into somebody's broken webapp or mobile app. Not only do the authors have no idea (either because they aren't notified of the errors, or don't care about them), there is no way for me to even e-mail the devs to tell them. I try to go through customer support, a chat agent, anything, and even they don't have a way to send in bug reports. They've insulated themselves from the knowledge of their own failures.
The way I would phrase it is: software engineering is the craft of delivering the right code at the right time, where "right code" means it can be trusted to do the "right thing."
A bit clunky, but I think that can be scaled from individual lines of code to features or entire systems, whatever you are responsible for delivering, and encompasses all the processes that go around figuring what code is to be actually written and making sure it does what it's supposed to.
Trust and accountability are absolutely a critical aspect of software engineering and the code we deliver. Somehow that is missed in all the discussions around AI-based coding.
The whole phenomenon of AI "workslop" is not a problem with AI, it's a problem with lack of accountability. Ironically, blaming workslop on AI rather than organizational dysfunction is yet another instance of shirking accountability!
I think there are two other things missing: Security and Maintainability. Working code that can never be touched again by a developer or requires an excessive amount of time to maintain is not part of a developers job either.
Overall, this hits the nail on the head about not delivering broken code and providing automated tests. Thanks for putting your thoughts on paper.
Having the coding agent make screenshots is a big power up.
I’m experimenting with how to get these into a PR, and the “gh” CLI tool is helpful.
Does anyone have a recipe to get a coding agent to record video of webflows?
I want to distill this post into some sort of liquid I can inject directly into my dev teams. It's absolutely spot on. Seeing a PR with a change that doesn't build is one of the most disappointing things.
When I start working on a ticket I actually start writing up a PR early on.
As I figure out my manual testing, I'll write out the steps that I took in my PR.
I've found that writing it out as I go does two things: 1) It makes it easier to have a detailed PR and 2) it acts as a form of rubber-ducking. As I'm updating my PR I'll realize steps I've missed in my testing.
Something that also helped out with my manual testing skill was working in a place that had ZERO automated testing. Every PR required a detailed testing plan that you did and that your reviewer could re-create.
I’d go further: it’s not enough to be able to prove that your code works. It’s required that you also understand why it works.
Otherwise you’ll end up in situations where it passes all test cases yet fails for something unexpected in the real world, and you don’t know why, because you don’t even know what’s going on under the hood.
There is a heavy emphasis on testing the code as the way to provide guarantees that it works. While this is a helpful tool, I often find that the best engineers are ones who take a more first principles approach to the code and can reason about why the solution is comprehensive (covers all edge cases) and clean (easy for humans and LLMs to build on).
It often takes discipline to think and completely map out solutions before you build. This is where experience and knowing common patterns can also help.
When you have the experience of having manually written or read a lot of code it helps at the very least quickly understand what the LLMs are writing and reason about it later even if not at the beginning.
Manual testing as the first step… not very productive imo.
Outside in testing is great but I typically do automated outside in testing and only manual at the end. The loop process of testing needs to be repeatable and fast, manual is too slow
Who is "you"?
It's not my job, really. And given by the state of IT these days it's barely anyone's it seems.
There is whole range of “proven to work” - regarding testing you cannot prove that there are no bugs.
Your job is to the deliver code up to specifications.
Not even checking the happy flow at least is of course gross negligence. But so is spending too much time on edge cases that no one will run into or person asking doesn’t want to pay for covering.
> As software engineers we…
That’s the thing. People exposing such rude behavior usually are not, or haven’t been in a looong time…
As for the local testing part not being performed, this is a slippery slope I’m fighting everyday: more and more cloud based services and platforms are used to deploy software to run with specific shenanigans and running it locally requires some kind of deep craft and understanding. Vendor lock-in is coming back in style (e.g. Databricks)
"Almost anyone can prompt an LLM to generate a thousand-line patch and submit it for code review. That’s no longer valuable. What’s valuable is contributing code that is proven to work."
I'd go further: what's valuable is code review. So review the AI agent's code yourself first, ensuring not only that it's proven to work, but also that it's good quality (across various dimensions but most importantly in maintainability in future). If you're already overwhelmed by that thousand-line patch, try to create a hundred-line patch that accomplishes the same task.
I expect code review tools to also rapidly change, as lines of code written per person dramatically increase. Any good new tools already?
Isn't this in contradiction to your blog post from yesterday though? It's impossible to prove a complex project made in 4.5 hours works. It might have passed 9000 tests, but surely there are always going to be edge cases. I personally wouldn't be comfortable claiming I've proved it works and saying the job is done even, if the LLM did the whole thing and all existing tests passed, until I played with it for several months. And even then I would assume I would need to rely on bug reports coming in because it's running on lots of different systems. I honestly don't know if software is ever really finished.
My takeaway from your blog post yesterday was that with a robust enough testing system the LLM can do the entire thing while I do Christmas with the family.
(Before all the AI fans come in here. I'm not criticizing AI.)
I don't think this quite captures the problem: even if the code is functional and proven to work, it can still be bad in many other ways.
The submitter should understand how it works and be able to 'own' and review modifications to it. That's cognitive work submitters ipso facto don't do by offloading the understanding to an LLM. That's the actual hard work reviewers and future programmers have to do instead.
Since they’re robots, automated tests and manual tests are effectively the same thing.
I'd buttress this statement with a nuance. Automated tests typically run in their entirety, usually by a well-known command like cargo test or at least by the CI tools. Manual tests are often skipped because the test seems to be far away from the code being changed.
My all-time favorite team had a rule that your code didn't exist if it didn't have automated tests to "defend" it. If it didn't, it was OK, or at least not surprising, for someone else to break or refactor it out of existence (not maliciously, of course).
I agree with the author overall. Manual testing is what I call "vibe testing" and I think by itself is insufficient, no matter if you or the agent wrote the code. If you build your tests well, using the coding agent becomes smooth and efficient, and the agent is safe to do longer stretches of work. If you don't do testing, the whole thing is just a bomb ticking in your face.
My approach to coding agents is to prepare a spec at the start, as complete as possible, and develop a beefy battery of tests as we make progress. Yesterday there was a story "I ported JustHTML from Python to JavaScript with Codex CLI and GPT-5.2 in hours". They had 9000+ tests. That was the secret juice.
So the future of AI coding as I see it ... it will be better than pre-2020, we will learn to spec and plan good tests, and the tests are actually our contract the code does what is supposed to do. You can throw away the code and keep the specs and tests and regenerate any time.
Well a 1000 line PR is still not welcome. It puts too much of a burden on the maintainers. Small PRs are the way to go, tests are great too. If you have to submit a big PR, get buy in from a maintainer first that they will review your code.
Not only to work, but to not make the life of those coders who come after you a hell.
Prove is a strong word. There are few cases in real-world programming where you can prove anything.
I prefer to make this probabilistic: use testing to reduce the probability that your code isn't correct, for the situations in which it is expected to be deployed. In this sense, coding and testing is much like doing experimental physics: we never really prove a theory or disprove it, we just invalidate clearly wrong ones.
I agree that tests and automation are probably the best things we can do to validate our code and author of the PR should be more responsible. However they can't prove that the code works. It's almost the opposite: If they pass and it's a good coverage, then code has better chances to work. If they fail, then they prove code doesn't work.
A lot of AI coding changes coding to more of a declarative practice.
Claude, etc, works best with good tests that verify the system works. And so the code becomes in some ways the tests rather than the code that does the thing. If you're responsible for the thing, then 90% of your responsibility moves to verifying behavior and giving agents feedback.
> The first is manual testing. If you haven’t seen the code do the right thing yourself, that code doesn’t work. If it does turn out to work, that’s honestly just pure chance.
Depending on exactly what the author meant here, I disagree. Our first and default tool should be some form of lightweight automated testing. It's explicit (serves a form of spec and docs how to use the software), it's repeatable (manual testing is done once and it's result is invalidated moments later), and it's cost per minute of effort is more or less the same (most companies have the engineers do the tests, they are expensive).
Yes. There will be exceptions and exceptional cases. This author is not talking about exceptions and neither am I. They're not an interesting addition to this conversation.
No my job is sitting eating this here donuts.
This is very helpful for a team and even though it takes a little time it actually speeds things up in the long run. Using PR templates can help. A general description of the problem including a screenshot or video go a long way.
I remember when I was working at a startup and a new engineer merged his code and it totally broke the service. I asked him if he ran his code locally first and he stared at me speechless.
Running the code locally is the easiest way to eliminate a whole series of silly bugs.
Like mentioned in the article adding a test and then reverting your change to make sure the test fails is really important, especially with LLMs writing tests. They are great at making things look like they work but completely don’t.
Non-native speaker here. I’ve always loved that we say “commit” not “upload” or “save”.
"Your job is to deliver code you have proven to work"
And...code that has been 100% reviewed, even if it was fully LLM generated.
> Almost anyone can prompt an LLM to generate a thousand-line patch and submit it for code review. That’s no longer valuable. What’s valuable is contributing code that is proven to work.
That's really not a great development for us. If our main point is now reduced to accountability over the result with barely any involvement in the implementation - that's very little moat and doesn't command a high salary. Either we provide real value or we don't ...and from that essay I think it's not totally clear what the value is - it seems like every QA, junior SWE or even product manager can now do the job of prompting and checking the output.
The solution is easy: responsibility.
The point is to hire people who can own code and codebase. “Someone will review” is dead end.
it's very similar to the verification engineering problem i wrote about on HN last week. AI is as good as we can prove their work is genuine. and we need humans in the loop to fill in the gaps between autonomous systems and ultimately be held accountable by human laws. it's kind of sad but the reality we are facing
>> As software engineers we don’t just crank out code—in fact these days you could argue that’s what the LLMs are for. We need to deliver code that works—and we need to include proof that it works as well.
I would go a step further: we need to deliver code that belongs. This means following existing patterns and conventions in the codebase. Without explicit instruction, LLMs are really bad at this, and it's one of the things that make it incredibly obvious to reviews that a given piece of code has been generated by AI.
"Slow the f*ck down." - Oliver Reichenstein [1]
This only happens because the software industry has fallen into the Religion of Speed. I see it constantly: justified corner-cutting, rushing shit out the door, and always loading up another feature/project/whatever with absolutely zero self-awareness. AI is just an amplifier for bad behavior that was already causing chaos.
What's not being said here but should be: discipline matters. It's part of being a professional and always precedes someone who can ship code that "just works."
[1] https://ia.net/*
> there’s one depressing anecdote that I keep on seeing: the junior engineer, empowered by some class of LLM tool, who deposits giant, untested PRs on their coworkers—or open source maintainers—and expects the “code review” process to handle the rest.
It's even worse than that: non-junior devs are doing it as well.