You can always tell claude to use red-green-refactor and that really is a step-up from "yeah don't forget to write tests and make sure they pass" at the end of the prompt, sure. But even better, tell it to create subagents to form red team, green team and refactor team while the main instance coordinates them, respecting the clean-room rules. It really works.
The trick is just not mixing/sharing the context. Different instances of the same model do not recognize each other to be more compliant.
So more stuff happens with this approach but how do you know what it generates is correct?
I’m telling it to use red/green tdd [1] and it will write test that don’t fail and then says “ah the issue is already fixed” and then move on. You really have to watch it very closely. I’m having a huge problem with bad tests in my system despite a “governance model” that I always refer it to which requires red/green tdd.
[1] https://simonwillison.net/guides/agentic-engineering-pattern...
This sounds interesting. Can you go a bit deeper or provide references on how to implement the green/red/refactor subagent pattern?
Good idea, and an improvement, but you still have that fundamental issue: you don't really know what code has been written. You don't know the refactors are right, in alignment with existing patterns etc.
How exactly do you set up your CC sessions to do this?
thats a great idea - i have been using codex to do my code reviews since i have it to give better critique on code written by claude but havent tried it with testing yet!
> But even better, tell it to create subagents to form red team, green team and refactor team while the main instance coordinates them, respecting the clean-room rules. It really works.
It helps, but it definitely doesn't always work, particularly as refactors go on and tests have to change. Useless tests start grow in count and important new things aren't tested or aren't tested well.
I've had both Opus 4.6 and Codex 5.3 recently tell me the other (or another instance) did a great job with test coverage and depth, only to find tests within that just asserted the test harness had been set up correctly and the functionality that had been in those tests get tested that it exists but its behavior now virtually untested.
Reward hacking is very real and hard to guard against.