I'm using what I call "hermetic agents", where completely sandboxed agents write code and tests from the same specification, where the code writer can't see the test and the test writer can't see the code. The idea is that we can get better quality this way (by avoiding confirmation bias between code and tests). It is more painful to set up however, since you have to distill a spec and guides that the agent would normally hook into using RAG.
It is more like people (agent?) management than coding though. I'm setting up and debugging processes, rather than writing code. I spend a lot of time cursing at and arguing with the agents I'm using to set up hermetic agents (who I can't argue with obviously, but I can have conventional agents go over their logs to figure out how to improve their sandboxed-context).
I did a thing sorta like this between app and infra. My app agent sends messages to the infra agent for what it needs and they go back and forth to sort things out. They invented their own working process on top of it and wrote their own tool. It's interesting watching all this work. Feels like the future more than a lot of things I've seen in the past few years. Shame we're doing this to the detriment of the environment and the economy.
"""
Claude Code mesh: gossip-based multi-instance coordination.
Usage:
mesh.py register --id ID --repo PATH --keyword KEYWORD
mesh.py list
mesh.py send --from ID --to ID MESSAGE
mesh.py broadcast --from ID MESSAGE
mesh.py watch --id ID
mesh.py forward --id ID MESSAGE_JSON
mesh.py peers --id ID [--n N]
mesh.py clean
"""do you mind sharing the more specific setup or agent framework? Hermes? LangChain? DIY?
Literally converged to this same pattern over this month.
Doing similar stuff. I curse a lot. The processes do transfer mostly between different models which a bit validates the approach.
This is similar to how our college CS problem sets were graded. We were given a spec, and we had to implement a program that conformed to it. We had access to 70% of the test suite during development, and another 30% was hidden and only evaluated after submission. We were graded out of 100.
It was effective at making you think about the problem and anticipate what tests might be missing. I can see how this would be effective for coding agents, which tend to get progressively lazier at writing tests as session context grows.