"GPT‑5.4 interprets screenshots of a browser interface and interacts with UI elements through coordinate-based clicking to send emails and schedule a calendar event."
They show an example of 5.4 clicking around in Gmail to send an email.
I still think this is the wrong interface to be interacting with the internet. Why not use Gmail APIs? No need to do any screenshot interpretation or coordinate-based clicking.
It feels like building humanoid robots so they can use tools built for human hands. Not clear if it will pay off, but if it does then you get a bunch of flexibility across any task "for free".
Of course APIs and CLIs also exist, but they don't necessarily have feature parity, so more development would be needed. Maybe that's the future though since code generation is so good - use AI to build scaffolding for agent interaction into every product.
Lots of services have no desire to ever expose an API. This approach lets you step right over that.
If an API is exposed you can just have the LLM write something against that.
A model that gets good at computer use can be plugged in anywhere you have a human. A model that gets good at API use cannot. From the standpoint of diffusion into the economy/labor market, computer use is much higher value.
Same reason why Wikipedia deals with so many people scraping its web page instead of using their API:
Optimizations are secondary to convenience
I think the desire is that in the long-term AI should be able to use any human-made application to accomplish equivalent tasks. This email demo is proof that this capability is a high priority.
This opens up a new question: how does bot detection work when the bot is using the computer via a gui?
A world where AIs use APIs instead of UIs to do everything is a world where us humans will soon be helpless, as we'll have to ask the AIs to do everything for us and will have limited ability to observe and understand their work. I prefer that the AIs continue to use human-accessible tools, even if that's less efficient for them. As the price of intelligence trends toward zero, efficiency becomes relatively less important.
APIs have never been a gift but rather have always been a take-away that lets you do less than you can with the web interface. It’s always been about drinking through a straw, paying NASA prices, and being limited in everything you can do.
But people are intimidated by the complexity of writing web crawlers because management has been so traumatized by the cost of making GUI applications that they couldn’t believe how cheap it is to write crawlers and scrapers…. Until LLMs came along, and changed the perceived economics and created a permission structure. [1]
AI is a threat to the “enshittification economy” because it lets us route around it.
[1] that high cost of GUI development is one reason why scrapers are cheap… there is a good chance that the scraper you wrote 8 years ago still works because (a) they can’t afford to change their site and (b) if they could afford to change their site changing anything substantial about it is likely to unrecoverably tank their Google rankings so they won’t. A.I. might change the mechanics of that now that you Google traffic is likely to go to zero no matter what you do.
Lowest common denominator.
Because the web and software more generally if full of not APIs and you do, in fact, need the clicking to work to make agents work generally
The ideal of REST, the HTML and UI is the API.
I guess a big chunk of their target market won't know how to use APIs.
not everything has an API, or API use is limited. some UIs are more feature complete than their APIs
some sites try to block programmatic use
UI use can be recorded and audited by a non-technical person
One could argue that LLMs learning programming languages made for humans (i.e. most of them) is using the wrong interface as well. Why not use machine code?
The vast majority of websites you visit don’t have usable APIs and very poor discovery of the those APIs.
Screenshots on the other hand are documentation, API, and discovery all in one. And you’d be surprised how little context/tokens screenshots consumer compared to all the back and forth verbose json payloads of APIs