> then you're stuck reading every line because it might've missed some edge case or bro...

0xbadcafebee • today at 6:02 PM • 1 reply • view on HN

> then you're stuck reading every line because it might've missed some edge case or broken something

This is what tests are for. Humans famously write crap code. They read it and assume they know what's going on, but actually they don't. Then they modify a line of code that looks like it should work, and it breaks 10 things. Tests are there to catch when it breaks so you can go back and fix it.

Agents are supposed to run tests as part of their coding loops, modifying the code until the tests pass. Of course reward hacking means the AI might modify the test to 'just pass' to get around this. So the tests need to be protected from the AI (in their own repo, a commit/merge filter, or whatever you want) and curated by humans. Initial creation by the AI based on user stories, but test modifications go through a PR process and are scrutinized. You should have many kinds of tests (unit, integration, end-to-end, regression, etc), and you can have different levels of scrutiny (maybe the AI can modify unit tests on the fly, and in PRs you only look at the test modifications to ensure they're sane). You can also have a different agent with a different prompt do a pre-review to focus only on looking for reward hacks.

Replies

CoffeeOnWrite • today at 7:01 PM

Tests are not free, over proliferation of AI-touched tests is itself a problem, similar to the problem duplicative and verbose AI-generated code.

And tests are inherently imperfect, they may not test the perfect layer, so they break when they shouldn't, and they certainly don't capture every premise.

I'm on board with the tactics you suggest, but they are only incrementally helpful. What we really need is AI that removes duplicative code and unnecessary tests.

alt Hacker News

Replies