I see this as the gap between an general-purpose agent and a coding agent. A coding agent can imagine something to be true, test it, discover that it's wrong, and recover.
But if you go beyond what can be tested easily, asking the agent to do real work rather than writing a patch, imagining things to be true is a problem.
This to me is the big leap from being good at coding to being good at many other tasks.
Coding could be treated as a low stakes (time & money consequences for retries) closed loop system where most other tasks cannot.
If it screws up booking your flight/hotel room, how does the agent verify this, and even if it verifies.. there is an actual cost to changes/cancellations.
Similar with agentic e-commerce, lots of ability to screw that up and just seems ripe for fraud / being picked off by bad actors.