I'm building something that fixes this exact problem[1].
The landing page doesn't advertise it yet, but essentially, I give agents a small set of tools to explore apps' surfaces, and then an API over common macOS functions, especially those related to accessibility.
The agent explores the app, then writes a repeatable workflow for it. Then it can run that workflow through CLI: `invoke chrome pinTab`
Why accessibility? Well, turns out that it's just a good DOM in general. It's structure for apps. Not all apps implement it perfectly, but enough do to make it wildly useful.
[1] https://getinvoke.com - note that the landing page is targeted towards creatives right now and doesn't talk about this use case yet
I’m missing the premise. For internal apps why would you ever reach for Computer Use vs just having your agent whip up a cli or MCP?
_of course_ computer use is worse. It is your last resort. Do not use it on state that lives in a DB that you own.
If anything I am impressed that it’s only 50x worse.
Is it possible to ask the vision agent to "map" the UI and expose it to another agent as a set of interfaces that resemble an API better? From what I understand the vision agent now should both know that "next page" shows more results and that they need to get more results in the first place.
If one agent just explores the UI, maybe in a test environment, and outputs a somewhat-structured description of the various UI elements and their behavior, then another agent was given that description, would the other agent perform better that an agent that both explores the UI and tries to accomplish the given task at the same time?
With an example UI I made up, the description (API-like interface definition) could be something like:
Get all reviews:
To get all the reviews you need to go to each page and click "show full review" for every review summary in that page.
Go to each page:
Start at page 1 (the default when in the Reviews tab). Continue by clicking the "next" button until the "next" button is no longer available (as you've reached the last page).
So the second agent can skip some thinking about how to navigate because it already has that skill. The first agent can explore the UI on its own, once, without worrying about messing up if there's a test environment.Or am I misunderstanding the article completely? Probably. But it's interesting nonetheless. Sorry if it makes no sense.
Totally agree. I’ve been building an AI visual tool recently and experimented with both approaches. The latency and c ost of generic "agentic" browser use are absolute dealbreakers for real-time consumer apps right now. Structured APIs (even just chained LLM calls with strict JSON schemas) are not only 40x cheaper, but more importantly, they are deterministic enough to actually build a stable product on top of. Computer use is an amazing demo, but structured APIs are what pay the server bills.
Many people are working on that :-)
Apps written now will have mcp servers / AI compatibility when relevant
The issue that still needs solving is how to make llms interact with everything we already have and use (efficiently, not with screenshot, read, screenshot, ...)
Most of the time that means reverse engineering, either the app itself or the APIs it uses
From github (not my projects):
https://github.com/SimoneAvogadro/android-reverse-engineerin... => reverse engineer android app APIs from APKs
https://github.com/HKUDS/CLI-Anything => convert ooen-source GUI apps to clis
https://github.com/kalil0321/reverse-api-engineer => API reverse engineering from traffic (claude skills)
My take at the same issue (very young project):
Also api reverse engineering from traffic captures, with a focus on mobile app, safety & community mcp generation
I'm always skeptical of the whole "computer use" concept. It's like hiring someone and inviting him to your house and telling him to go ahead, feel free to sleep on the bed, use the toilet, eat whatever is in the fridge, watch the TV, and oh here are the combinations for the safe... and that someone you hire is a monkey.
I wouldn't call it a benchmark since it's just one sample. They do highlight a real problem, though. Computer use is immature right now and far behind language agents
Try playing fruit ninja via text and llm toolcalls though
And structured APIs are about 1e9x more expensive than not invoking an LLM in the first place compared to using deterministic code to do something ... it's not like any of this is rational based on compute.
Wall clock time tells me everything I need to know. The vision model took almost 20 minutes to do the thing that Sonnet did in 20 seconds.
The only reason you wouldn’t choose an API is if it wasn’t viable.
Metadata and structure beats AI every time.
Computer Use? Or Browser Use? IMHO big diff
The problem is that not everything from the 'past' can be accessed via APIs. It would be a fun time - remember Prism [1] - I would just run that and get all the API calls in a nice format and then replay them over and over to do things in succession.
In the new world, we have access to OpenAPI.json and whatnot, but in the world where things were built in the days pre-OpenAPI and pre-specs and best practices...I am not so sure! (and a lot of world lives then)
Alas, this works for a good chunk of things but not everything. Which is why the other technnology exists.
All the websites currently blocking Claude Code or other AI agents are fighting a losing battle. Computer-use is in the early stages, and the thing preventing mass-adoption seems to be the number of tokens it takes. Agents can fumble around trying 10 CLI commands that don't work before finding the right one and we barely notice. But other visual agents (browser use / computer use etc) end up eventually fumbling on to the right thing, but we don't have the patience to wait 20 mins. to click a button. As tokens get cheaper + faster, we probably get the models that can use a UI interface just as natively as a CLI.
I think one main point is that not all "computer use" is the same, the harness and agentic experience matters a lot. A poorly designed API experience can actually be _less_ efficient than a well designed browser or computer use experience
In particular, the vision-based approach used in the evaluation has clear limitations with regard to efficiency due to its nature (small observation window, heterogeneous modality)
At Smooth we use an hybrid DOM/vision approach and we index very strongly on small models. An interesting fact is that UIs are generally designed to minimize ambiguity and supply all and only the necessary context as token-efficient as possible, and the UX is cabled up to abstract the APIs in well-understood interface patterns e.g. dropdowns or autocompletes. This makes navigation easier and that's why small models can do it, which is another dimension that must be considered
We typically recommend using APIs/MCP where available and well designed, but it's genuinely surprising how token-efficient agentic browser navigation can actually be
In an agentic world, the OS needs to be completely rethought. For example, every single app functionality should be exposable via an API while remaining human friendly.
I think OpenAI designing their own phone is the next logical step. I hope they succeed which should bring major competition to Apple and Android.
My "best practice" is to use as little "visual" (computer use) tooling and as much api + cli tooling as possible specifically to save on tokens.
Tokens a resource and should be managed as such.
Text based web browsing? Would love the comparison there. Tons of systems have a dom translation layer. I'm building around this with the concept of turn a webpage into text for an agent to use directly. I actually had to move away from haiku not because of accuracy problems but because it operated the browser too fast for a human to follow what it was doing. The real loss here are bespoke webapps like a figma or google docs which are near impossible to see what they are doing via the dom.
To me the browser is a translation layer. Working on the browser directly while hard enables big advantages on compatibility. The only thing I miss as of now which is on the todo is ocr of the images in the browser into text out. But an api would need to do that anyways to work.
The main loss in my view of pure API based is, where do you get the data? We won't replicate human work without seeing that done. Humans work in the UI that's it. Computer use to me is the promise of being able to replicate end to end actions a human does. API can do that in theory but the data to do that is also near impossible to collect properly.
Vision has a long way to go. I remember trying an early version of AWS's Nova Act and laughed at how slow it was. And a few months later it hadn't really seemed to improve that much.
Recently, I asked Claude to log into my local grocery store chain's website and add all of the items from my shopping list to a cart. It was hilariously slow, but it did get the job done.
Unless I missed it, the article doesn't explictly mention speed in the copy, but the results do show a 17 minute (!!!) total time for the vision agent vs. 0.5s - 2.8s for the API approach.
A big part of the challenge with vision is that to manipulate the DOM, you first have to be sure the entire (current) DOM is loaded. In my experience this ends up in adding a lot of artificial waits for certain elements to exist on the page.
Sounds like some efficiency gains will still arrive.
What i don't understand about "computer use" is why they're not just grabbing the window handles and storing them to determine what should be clicked after the first few iterations of using that a specific application. if a new case / path / whatever is found, drop back to screen grabbing and bounding boxes and then figure the handles that are there and store after.
idk.. not really thought out too much, but has to be better
This tracks - has been my experience exactly. Not to mention there isn’t particularly a significant lift in inaccuracy or speed. As things stand, to me it is the worst of both worlds. Expensive and inaccurate.
> This is not a model problem. The vision agent was reasoning about a rendered page and had no signal that the page wasn't showing everything.
> To make the comparison apples-to-apples, we rewrote the vision prompt as an explicit UI walkthrough, naming the sidebar items, tabs, and form fields the agent should interact with at each step. Fourteen numbered instructions covering the navigation the agent had failed to figure out on its own.
This is a model problem, though. Because the model failed to understand it could scroll, you forced it to consume multiples of the tokens. Could you come up with an alternative here?
Do you know what the vision model was trained on? Because often people see “vision model” and think “human-level GUI navigator” when afaik the latter has yet to be built.
Isn't it possible to somehow wire this into the window manager? Wayland or whatever. Have it speak the native window lang rather than crunch the pixels? At least for the majority.
I can see the appeal in pixel route given universality but wow that seems ugly on efficiency
Its funny watching the slow mean reversion back to more deterministic tooling.
by design: https://en.wikipedia.org/wiki/Desire_path
IMO, this is the argument for doing work in the first place.
It would be great if institutions like banks provided proper APIs.
I'm pretty unsurprised that the vision agent did worse. I'd be interested in a comparison between the different tools that now exist to let LLMs drive browsers (e.g. vercel's agent-browser, the relatively new dev-browser[1], etc.)
There are usecases where the vision agent is the more obvious, or only choice though, e.g. prorprietary/locked-down desktop apps that lack an automation layer.
I think of computer use as like last mile delivery. APIs and bash and such are the efficient logistics networks. Both have different benefits. Obviously, use the efficient methods when you can.
I have only found Computer Use useful for GUI app local debugging. Presumably it will also be useful for getting around protections for external apps that don't want AI to interact with them, or for interfacing with legacy apps or those built without AI in mind.
I don't think any new app should ever be specifically designed for AI to interact with them through computer use
The best use cases I've seen for computer/browser use is for legacy SaaS/Software. For example, hotels use archaic Property Management Systems (PMS) and they're required by corporate to use it and pay for it. These companies can barely keep the product alive, they definitely aren't incentivized to maintain an API. In such a case browser use agent seems to be the best (only) way.
Just wondering: RPA companies like UiPath ard dead in the water, right?
The hard part about the web is that API's aren't just available even if the website owner wants them exposed (big if).
I embedded a Google Calendar widget on my Book a demo page, I don't know the API and Google doesn't expose/maintain one either.
What we are doing at Retriever AI is to instead reverse engineer the website APIs on the fly and call them directly from within the webpage so that auth/session tokens propoagate for free: https://www.rtrvr.ai/blog/ai-subroutines-zero-token-determin...
I've been thinking of things I'd want an agent for recently. The problem is, everything I think of is something that requires using quite a few different websites, saving a lot of data securely, and working with a lot of sensitive accounts (my email, etc.)
The problem is, all the tasks are essentially: a) things agents probably just can't do, and b) things that absolutely cannot afford to be hallucinated or otherwise fucked up. So far the tasks I've thought of:
- Taxes. So it needs a lot of sensitive information to get W2's. Since I have to look up a lot of this stuff in the physical world anyway, it's not like I can just let it run wild.
- Background check for a new job. It took me 3 hrs to fill out one of them (mostly because the website was THAT bad). Being myself, I already was making mistakes just forgetting things like move in dates from 10 years ago, and having to do a lot of searching in my email for random documents. No way I'm trusting an agent with this.
- Setting up an LLC. Nope nope nope. There's a lot of annoying work involved with this, but I'm not trusting an LLM to do this.
Anyway, I guess my point is that even if an LLM was good at using my computer (so far, it seems like it wouldn't be), the kind of things I'd want an agent for are things that an LLM can't be trusted with.
Confusing title? "Computer Use" is actually "Browser vision"?
It doesn't matter.
Electron uses 10x more RAM than regular apps. But it's so convenient.
Python is 100x slower than C. It's in the top 3 of languages now.
Worse but more convenient always wins.
This is obvious. The problem is that not everything has an API, while everything has a human-oriented UI.
We need a superset of HTML that is designed for agents. I'm not sure it's quite as simple as "just make everything an API."
I find this extremely surprising.
When you think of everything it takes for an AI to use what the article calls a "vision agent" then it seems as if using a purpose-made API ought to be MANY orders of magnitude faster.
I have a similar finding for a website I made that collates college town bar specials and live music. Using agents with vision models works but it's not as straightforward as one would initially think. U can check out the results here. https://www.nittanynights.com
UX feedback
Me: hmm, this title confuses and infuriates Rob.
[Clicks link]
Me: Sees same title, repeat feelings of confusion and infuration
[Scrolls article down on my smartphone]
Me: Sees jpg with the same title, repeat feelings of co fusion and infuriation.
[Closes tab]
[Continues living rest of my life]
I hope this feedback is well received and understood.
Browser agents / vision agents are a menace and ISPs should outright ban subscribers who run them on the public internet.
Only 45x?
The interface designed for humans is poor for AI needs? And the interface designed for programmatic use is easier for the AI to use? In other news, the sky is blue and water is wet.
This is missing the point that AI training probably costed boatloads more to achieve to get here.
For now.
[flagged]
[dead]
[flagged]
Great guidance hidden in here for making it expensive for agents to navigate your website. Move elements on screen as the mouse moves, force natural mouse movement to make the UI work, change the button labels in the JS to be randomly named every visit, force scrolling to the bottom of the screen to check for hidden extra tasks...
Hang on, that sounds like common corporate SaaS apps.