> The key insight from this benchmark is using "human-equivalent hours" rather than actual AI execution time. It's measuring capability complexity, not speed.
> What's interesting is the 50% vs 80% reliability gap. At 50% success rate on a 4-hour task, you're essentially gambling. If it fails, you've potentially wasted the 4 hours plus the time debugging why it failed.
Your first two paragraphs are at odds with each other. If it fails, you've potentially wasted the time it took the agent to *perform* the "it takes humans 4h" long task. Which in most cases is single digit minutes.
That's why one of the solid use cases for agents is doing multiple throw away proof of concepts to explore a problem / new feature before deciding on a solution to actually implement. Usually you'd have time for one, or maybe none. If it fails you've lost a maybe 10 minutes, but likely learned something new about the potential solution.