What I want is a harness that knows how to optimize this kind of thing for me.
You might want to check out Amp: https://ampcode.com/
Which is your own harness and your own evals for your tasks I guess
You might want to check out Amp: https://ampcode.com/