try other harnesses than codex.
ive had more success with review tools, rather than the agent getting the code quality right the first time.
current workflow
1. specs/requirements/design, outputting tasks 2. implementation, outputting code and tests 3. run review scripts/debug loops, outputting tasks 4. implement tasks 5. go back to 3
the quality of specs, tasks, and review scripts make a big difference
one of the biggest things that gets the results better is if you can get a feedback loop in from what the app actually does back to the agent. good logs, being able to interact/take screenshots a la playwright etc
guidelines and guardrails are best if theyre tools that the agent runs, or that run automatically to give feedback.