Have you considered doing property tests with Mathematica as an oracle?
An ai based development workflow with a concrete oracle works very well. You still need the research and planing to solve things in a scalable way, but it solves the "are the tests correct" issue.
What we've done is pull out failing property tests as a unit tests, makes regression testing during the agentic coding loop much more efficient.