I actually have a "LLM as a judge" loop on all my codebases. I have an architecture panel that debates improvements given an optimization metric and convergence criteria and I feed their findings into a deterministic spec generator (cue /w validation) that can emit unit/e2e tests, scaffold terraform. It's pretty magical.
This cue spec gets decomposed into individual tasks by an orchestrator that does research per ticket and bundles it.
I think we are all building the same thing. If only there was an open source framework for aggregating all our work together.