This to me is the big leap from being good at coding to being good at many other tasks.
Coding could be treated as a low stakes (time & money consequences for retries) closed loop system where most other tasks cannot.
If it screws up booking your flight/hotel room, how does the agent verify this, and even if it verifies.. there is an actual cost to changes/cancellations.
Similar with agentic e-commerce, lots of ability to screw that up and just seems ripe for fraud / being picked off by bad actors.
To reply to myself here..
I can STILL replicate this behavior in Google AI summaries 10% of the time:
"is <SOMEPLANT> ok for cats"
to which it replies: "Yes, <SOMEPLANT LONG SCIENTIFIC NAME VERBOSE PHRASING> is toxic for cats"
The other one going around this weekend: "how long hot dogs on grill"
Summary: "The hot dogs on your grill are likely around 5-6 inches long .. "
So scale this category of error to unsupervised agents with access to your credit card.
Seems like to make agents safe we need tentative, reversible transactions. How do you set up a travel plan and then review it? How do you modify it later?
Unfortunately, travel keeps getting less flexible, with worse cancelation policies.