I'm convinced now that the key to getting useful results out of coding agents (Claude Code, Cod...

simonw • yesterday at 9:35 PM • 23 replies • view on HN

I'm convinced now that the key to getting useful results out of coding agents (Claude Code, Codex CLI etc) is having good mechanisms in place to help those agents exercise and validate the code they are writing.

At the most basic level this means making sure they can run commands to execute the code - easiest with languages like Python, with HTML+JavaScript you need to remind them that Playwright exists and they should use it.

The next step up from that is a good automated test suite.

Then we get into quality of code/life improvement tools - automatic code formatters, linters, fuzzing tools etc.

Debuggers are good too. These tend to be less coding-agent friendly due to them often having directly interactive interfaces, but agents can increasingly use them - and there are other options that are a better fit as well.

I'd put formal verification tools like the ones mentioned by Martin on this spectrum too. They're potentially a fantastic unlock for agents - they're effectively just niche programming languages, and models are really good at even niche languages these days.

If you're not finding any value in coding agents but you've also not invested in execution and automated testing environment features, that's probably why.

Replies

roadside_picnic • yesterday at 10:49 PM

I very much agree, and believe using languages with powerful types systems could be a big step in this direction. Most people's first experience with Haskell is "wow this is hard to write a program in, but when I do get it to compile, it works". If this works for human developers, it should also work for LLMs (especially if the human doesn't have to worry about the 'hard to write a program' part).

> The next step up from that is a good automated test suite.

And if we're going for a powerful type system, then we can really leverage the power of property tests which are currently grossly underused. Property tests are a perfect match for LLMs because they allow the human to create a small number of tests that cover a very wide surface of possible errors.

The "thinking in types" approach to software development in Haskell allows the human user to keep at a level of abstraction that still allows them to reason about critical parts of the program while not having to worry about the more tedious implementation parts.

Given how much interest there has been in using LLMs to improve Lean code for formal proofs in the math community, maybe there's a world where we make use of an even more powerful type systems than Haskell. If LLMs with the right language can help prove complex mathematical theorems, they it should certain be possible to write better software with them.

➕ show 4 replies

ManuelKiessling • yesterday at 10:14 PM

Isn‘t it funny how that’s exactly the kind of stuff that helps a human developer be successful and productive, too?

Or, to put it the other way round, what kind of tech leads would we be if we told our junior engineers „Well, here’s the codebase, that’s all I‘ll give you. No debuggers, linters, or test runners for you. Using a browser on your frontend implementation? Nice try buddy! Now good luck getting those requirements implemented!“

➕ show 1 reply

mccoyb • yesterday at 9:54 PM

Claude Code was a big jump for me. Another large-ish jump was multi-agents and following the tips from Anthropic’s long running harnesses post.

I don’t go into Claude without everything already setup. Codex helps me curate the plan, and curate the issue tracker (one instance). Claude gets a command to fire up into context, grab an issue - implements it, and then Codex and Gemini review independently.

I’ve instructed Claude to go back and forth for as many rounds as it takes. Then I close the session (\new) and do it again. These are all the latest frontier models.

This is incredibly expensive, but it’s also the most reliable method I’ve found to get high-quality progress — I suspect it has something to do with ameliorating self-bias, and improving the diversity of viewpoints on the code.

I suspect rigorous static tooling is yet another layer to improve the distribution over program changes, but I do think that there is a big gap in folk knowledge already between “vanilla agents” and something fancy with just raw agents, and I’m not sure if just the addition of more rigorous static tooling (beyond the compiler) closes it.

➕ show 1 reply

rmah • today at 12:50 AM

I think the main limitation is not code validation but assumption verification. When you ask an LLM to write some code based on a few descriptive lines of text, it is, by necessity, making a ton of assumptions. Oddly, none of the LLM's I've seen ask for clarification when multiple assumptions might all be likely. Moreover, from the behavior I've seen, they don't really backtrack to select a new assumption based on further input (I might be wrong here, it's just a feeling).

What you don't specify, it must to assume. And therein lies a huge landscape of possibilities. And since the AI's can't read your mind (yet), its assumptions will probably not precisely match your assumptions unless the task is very limited in scope.

➕ show 1 reply

apitman • today at 8:14 PM

That's bad news for C++, Rust, and other slow compilers.

Davidzheng • yesterday at 11:56 PM

OK but if the verification loop really makes the agents MUCH more useful, then this usefulness difference can be used as a training signal to improve the agents themselves. So this means the current capabilities levels are certainly not going to remain for very long (which is also what I expect but I would like to point out it's also supported by this)

➕ show 1 reply

oxag3n • yesterday at 10:35 PM

Where they'd get training data?

Source code generation is possible due to large training set and effort put into reinforcing better outcomes.

I suspect debugging is not that straightforward to LLM'ize.

It's a non-sequential interaction - when something happens, it's not necessarily caused the problem, timeline may be shuffled. LLM would need tons of examples where something happens in debugger or logs and associate it with another abstraction.

I was debugging something in gdb recently and it was a pretty challenging bug. Out of interest I tried chatgpt, and it was hopeless - try this, add this print etc. That's not how you debug multi-threaded and async code. When I found the root cause, I was analyzing how I did it and where did I learn that specific combination of techniques, each individually well documented, but never in combination - it was learning from other people and my own experience.

➕ show 6 replies

CobrastanJorji • yesterday at 10:32 PM

I might go further and suggest that the key to getting useful results out of HUMAN coding agents is also to have good mechanisms in place to help them exercise and validate the code.

We valued automated tests and linters and fuzzers and documentation before AI, and that's because it serves the same purpose.

➕ show 1 reply

QuercusMax • yesterday at 9:37 PM

I've only done a tiny bit of agent-assisted coding, but without the ability to run tests the AI will really go off the rails super quick, and it's kinda hilarious to watch it say "Aha! I know what the problem is!" over and over as it tries different flavors until it gives up.

pron • yesterday at 11:18 PM

Maybe in the short term, but that doesn't solve some fundamental problems. Consider, NP problems, problems whose solutions can be easily verified. But that they can all be easily verified does not (as far as we know) mean they can all be easily solved. If we look at the P subset of NP, the problems that can be easily solved, then the easy verification is no longer their key feature. Rather, it is something else that distinguishes them from the harder problems in NP.

So let's say that, similarly, there are programming tasks that are easier and harder for agents to do well. If we know that a task is in the easy category, of course having tests is good, but since we already know that an agent does it well, the test isn't the crucial aspect. On the other hand, for a hard task, all the testing in the world may not be enough for the agent to succeed.

Longer term, I think it's more important to understand what's hard and what's easy for agents.

rodphil • today at 1:49 AM

> At the most basic level this means making sure they can run commands to execute the code - easiest with languages like Python, with HTML+JavaScript you need to remind them that Playwright exists and they should use it.

So I've been exploring the idea of going all-in on this "basic level" of validation. I'm assembling systems out of really small "services" (written in Go) that Claude Code can immediately run and interact with using curl, jq, etc. Plus when building a particular service I already have all of the downstream services (the dependencies) built and running so a lot of dependency management and integration challenges disappear. Only trying this out at a small scale as yet, but it's fascinating how the LLMs can potentially invert a lot of the economics that inform the current conventional wisdom.

(Shameless plug: I write about this here: https://twilightworld.ai/thoughts/atomic-programming/)

My intuition is that LLMs will for many use cases lead us away from things like formal verification and even comprehensive test suites. The cost of those activities is justified by the larger cost of fixing things in production; I suspect that we will eventually start using LLMs to drive down the cost of production fixes, to the point where a lot of those upstream investments stop making sense.

➕ show 1 reply

simianwords • today at 3:35 AM

Claude code and other AI coding tools must have a * mandatory * hook for verification.

For front end - the verification is make sure that the UI looks expected (can be verified by an image model) and clicking certain buttons results in certain things (can be verified by chatgpt agent but its not public ig).

For back end it will involve firing API requests one by one and verifying the results.

To make this easier, we need to somehow give an environment for claude or whatever agent to run these verifications on and this is the gap that is missing. Claude Code, Codex should now start shipping verification environments that make it easy for them to verify frontend and backend tasks and I think antigravity already helps a bit here.

------

The thing about backend verification is that it is different in different companies and requires a custom implementation that can't easily be shared across companies. Each company has its own way to deploy stuff.

Imagine a concrete task like creating a new service that reads from a data stream, runs transformations, puts it in another data stream where another new service consumes the transformed data and puts it into an AWS database like Aurora.

``` stream -> service (transforms) -> stream -> service -> Aurora ```

To one shot this with claude code, it must know everything about the company

- how does one consume streams in the company? Schema registry?

- how does one create a new service and register dependencies? how does one deploy it to test environment and production?

- how does one even create an Aurora DB? request approvals and IAM roles etc?

My question is: what would it take for Claude Code to one shot this? At the code level it is not too hard and it can fit in context window easily but the * main * problem is the fragmented processes in creating the infra and operations behind it which is human based now (and need not be!).

-----

My prediction is that companies will make something like a new "agent" environment where all these processes (that used to require a human) can be done by an agent without human intervention.

I'm thinking of other solutions here, but if anyone can figure it out, please tell!

jijijijij • yesterday at 10:21 PM

> At the most basic level this means making sure they can run commands to execute the code

Yeah, it's gonna be fun waiting for compilation cycles when those models "reason" with themselves about a semicolon. I guess we just need more compute...

thomasfromcdnjs • today at 12:29 AM

shameless plug: I'm working on an open source project https://blocksai.dev/ to attempt to solve this. (and just added a note for me to add formal verification)

Elevator pitch: "Blocks is a semantic linter for human-AI collaboration. Define your domain in YAML, let anyone (humans or AI) write code freely, then validate for drift. Update the code or update the spec, up to human or agent."

(you can add traditional linters to the process if you want but not necessary)

The gist being you define a bunch of validators for a collection of modules you're building (with agentic coding) with a focus on qualifying semantic things;

- domain / business rules/measures

- branding

- data flow invariants — "user data never touches analytics without anonymization"

- accessibility

- anything you can think of

Then you just tell your agentic coder to use the cli tool before committing, so it keeps the code in line with your engineering/business/philosophical values.

(boring) example of it detecting if blog posts have humour in them, running in Claude Code -> https://imgur.com/diKDZ8W

➕ show 2 replies

zahlman • yesterday at 10:41 PM

One objection: all the "don't use --yolo" training in the world is useless if a sufficiently context-poisoned LLM starts putting malware in the codebase and getting to run it under the guise of "unit tests".

➕ show 1 reply

WhyOhWhyQ • yesterday at 10:32 PM

I've tried getting claude to set up testing frameworks, but what ends up happening is it either creates canned tests, or it forgets about tests, or it outright lies about making tests. It's definitely helpful, but feels very far from a robust thing to rely on. If you're reviewing everything the AI does then it will probably work though.

➕ show 2 replies

ramoz • yesterday at 10:43 PM

Claude Code hooks is a great way to integrate these things

htrp • yesterday at 10:08 PM

Better question is which tools at what level

agumonkey • yesterday at 9:37 PM

gemini and claude do that already IIUC, self debugging iterations

dionian • yesterday at 10:39 PM

you've done some great articles on this topic and my experience aligns with your view completely.

goryDeets • yesterday at 11:35 PM

[dead]

formerly_proven • yesterday at 9:54 PM

Not so sure about formal verification though. ime with Rust LLM agents tend to struggle with semi-complex ownership or trait issues and will typically reach for unnecessary/dangerous escape hatches ("unsafe impl Send for ..." instead of using the correct locks, for example) fairly quickly. Or just conclude the task is impossible.

> automatic code formatters

I haven't tried this because I assumed it'll destroy agent productivity and massively increase number of tokens needed, because you're changing the file out under the LLM and it ends up constantly re-reading the changed bits to generate the correct str_replace JSON. Or are they smart enough that this quickly trains them to generate code with zero-diff under autoformatting?

But in general of course anything that's helpful for human developers to be more productive will also help LLMs be more productive. For largely identical reasons.

➕ show 2 replies

alt Hacker News

Replies