I think the main value lies in allowing the agent to try many things while you aren't working (when you are sleeping or doing other activities), so even if many tests are not useful, with many trials it can find something nice without any effort on your part.
This is, of course, only applicable if doing a single test is relatively fast. In my work a single test can take half a day, so I'd rather not let an agent spend a whole night doing a bogus test.
Experiments for us cost on the order of tens of dollars, so doing 100 of them every night quickly becomes the price of an entire new employee. And that’s not even including the cost of letting agents run all night.
Definitely not in the budget for non-VC-backed companies who aren’t in the AI bubble.
> single test can take half a day
Why is that?
I don't doubt you, but when Shigeo Shingo created SMED (Single Minute Exchange of Die), die changes were an hours long process.
Even if your tests take a long time, you can always (if hardware permits) run multiple tests in parallel. This would enable you to explore many approaches at the same time.