logoalt Hacker News

dregitskytoday at 5:48 PM0 repliesview on HN

To add to what @nab said, the longest ("overnight") runs are usually after going back and forth to build out a big multi-phase plan doc -- especially when each phase has an extensive manual test plan (agent runs the app in a browser, clicks through the workflow, watches logs, confirms behavior, etc).

These can go for many hours from all the manual testing and debugging. Quality really depends on how much you spec things out beforehand, and how you define the test plan / "success" gates. If the agent can't even run the app to test it then things can definitely go off the rails!