Who or what will review the 5 PRs (including their updates to automated tests)? If it's just yet another agent, do we need 5 of these reviews for each PR too?
In the end, you either concede control over 'details' and just trust the output or you spend the effort and validate results manually. Not saying either is bad.
If you can define your problem well then you can write tests up front. An ML person would call tests a "verifier". Verifiers let you pump compute into finding solutions.