I'm pretty unsurprised that the vision agent did worse. I'd be interested in a comparison between the different tools that now exist to let LLMs drive browsers (e.g. vercel's agent-browser, the relatively new dev-browser[1], etc.)
There are usecases where the vision agent is the more obvious, or only choice though, e.g. prorprietary/locked-down desktop apps that lack an automation layer.
Interesting! I'll play around with agent-browser and update this article if anything comes up