There's probably a human manager going "Great! How cone I can't get my engineering team to ship this much QUALITY?"
Interesting experiment. Looking at this I immediately thought similar experiment run by Google: AlphaEvolve. Throwing LLM compute at problems might work if the problem is well defined and the result can be objectively measured.
As for this experiment: What does quality even mean? Most human devs will have different opinions on it. If you would ask 200 different devs (Claude starts from 0 after each iteration) to do the same, I have doubts the code would look much better.
I am also wondering what would happen if Claude would have an option to just walk away from the code if its "good enough". For each problem most human devs run cost->benefit equation in their head, only worthy ideas are realized. Claude does not do it, the code writing cost is very low on his site and the prompt does not allow any graceful exit :)
for all the bad code havoc was most certainly not 'wrecked', it may have been 'wreaked' though . . .
Don't use cloc in 2025. Use tokei or whatever.
This strikes me as a very solid methodology for improving the results of all AI coding tools. I hope Anthropic, etc take this up.
Rather than converging on optimal code (Occam's Razor for both maintainability and performance) they are just spewing code all over the scene. I've noticed that myself, of course, but this technique helps to magnify and highlight the problem areas.
It makes you wonder how much training material was/is available for code optimization relative to training material for just coding to meet functional requirements. And therefore, what's the relative weight of optimizing code baked into the LLMs.
Am I the only one that is surprised that the app still works?!
Well, given it can't say "no, I think it's good enough now", you'll just get madness, no?
I would love to see an experiment done like this with an arena of principal engineer agents. Give each of them a unique personality: this one likes shiny new objects and is willing to deal with early adopter pain, this one is a neckbeard who uses emacs as pid 1 and sends email via usb thumbdrive, and the third is a pragmatic middle of the road person who can help be the glue between them. All decisions need to reach a quorum before continuing. Better yet: each agent is running on a completely different model from a different provider. 3 can be a knob you dial up to 5, 10, etc. Each of these agents can spawn sub-agents, to reach out to professionals like a CSS export, or a DBA.
I think prompt engineering could help here a bit, adding some context on what a quality codebase is, remove everything that is not necessary, consider future maintainability (20->84k lines is a smell). All of these are smells that like a simple supervisor agent could have caught.
The viewport of this website is quite infuriating. I have to scroll horizontally to see the `cloc` output, but there's 3x the empty space on either side.
So now you know. You can get claude to write you a ton of unit tests and also improve your static typing situation. Now you can restrict your prompt!
20K --> 84K lines of ts for a simple app is bananas. Much madness indeed! But also super interesting, thanks for sharing the experiment.
This really mirrors my experience trying to get LLMs to clean up kernel driver code, they seem utterly incapable of simplifying things.
that's my experience with AI, most times it creates an overengineered solution unless told it to keep it simple
Just the headline sounds like a YouTube brain rot video title:
"I spent 200 days in the woods"
"I Google translated this 200 times"
"I hit myself with this golf club 200 times"
Is this really what hacker news is for now?
When I ask coding agents to add tests, they often come up with something like this:
So I am not at all surprised about Claude adding 5x tests, most of which are useless.It's going to be fun to look back at this and see how much slop these coding agents created.