"GPT‑5.4 interprets screenshots of a browser interface and interacts with UI elements through c...

mattas • yesterday at 6:15 PM • 16 replies • view on HN

"GPT‑5.4 interprets screenshots of a browser interface and interacts with UI elements through coordinate-based clicking to send emails and schedule a calendar event."

They show an example of 5.4 clicking around in Gmail to send an email.

I still think this is the wrong interface to be interacting with the internet. Why not use Gmail APIs? No need to do any screenshot interpretation or coordinate-based clicking.

Replies

bottlepalm • yesterday at 9:02 PM

The vast majority of websites you visit don’t have usable APIs and very poor discovery of the those APIs.

Screenshots on the other hand are documentation, API, and discovery all in one. And you’d be surprised how little context/tokens screenshots consumer compared to all the back and forth verbose json payloads of APIs

➕ show 1 reply

npilk • yesterday at 6:58 PM

It feels like building humanoid robots so they can use tools built for human hands. Not clear if it will pay off, but if it does then you get a bunch of flexibility across any task "for free".

Of course APIs and CLIs also exist, but they don't necessarily have feature parity, so more development would be needed. Maybe that's the future though since code generation is so good - use AI to build scaffolding for agent interaction into every product.

➕ show 1 reply

f0e4c2f7 • yesterday at 7:09 PM

Lots of services have no desire to ever expose an API. This approach lets you step right over that.

If an API is exposed you can just have the LLM write something against that.

coffeemug • yesterday at 7:02 PM

A model that gets good at computer use can be plugged in anywhere you have a human. A model that gets good at API use cannot. From the standpoint of diffusion into the economy/labor market, computer use is much higher value.

MattDaEskimo • yesterday at 8:26 PM

Same reason why Wikipedia deals with so many people scraping its web page instead of using their API:

Optimizations are secondary to convenience

TheAceOfHearts • yesterday at 6:20 PM

I think the desire is that in the long-term AI should be able to use any human-made application to accomplish equivalent tasks. This email demo is proof that this capability is a high priority.

kristianp • yesterday at 8:20 PM

This opens up a new question: how does bot detection work when the bot is using the computer via a gui?

➕ show 1 reply

modeless • yesterday at 6:46 PM

A world where AIs use APIs instead of UIs to do everything is a world where us humans will soon be helpless, as we'll have to ask the AIs to do everything for us and will have limited ability to observe and understand their work. I prefer that the AIs continue to use human-accessible tools, even if that's less efficient for them. As the price of intelligence trends toward zero, efficiency becomes relatively less important.

PaulHoule • yesterday at 6:25 PM

APIs have never been a gift but rather have always been a take-away that lets you do less than you can with the web interface. It’s always been about drinking through a straw, paying NASA prices, and being limited in everything you can do.

But people are intimidated by the complexity of writing web crawlers because management has been so traumatized by the cost of making GUI applications that they couldn’t believe how cheap it is to write crawlers and scrapers…. Until LLMs came along, and changed the perceived economics and created a permission structure. [1]

AI is a threat to the “enshittification economy” because it lets us route around it.

[1] that high cost of GUI development is one reason why scrapers are cheap… there is a good chance that the scraper you wrote 8 years ago still works because (a) they can’t afford to change their site and (b) if they could afford to change their site changing anything substantial about it is likely to unrecoverably tank their Google rankings so they won’t. A.I. might change the mechanics of that now that you Google traffic is likely to go to zero no matter what you do.

➕ show 3 replies

time0ut • yesterday at 11:13 PM

Lowest common denominator.

jstummbillig • yesterday at 6:42 PM

Because the web and software more generally if full of not APIs and you do, in fact, need the clicking to work to make agents work generally

satvikpendem • yesterday at 6:24 PM

The ideal of REST, the HTML and UI is the API.

Jacques2Marais • yesterday at 6:22 PM

I guess a big chunk of their target market won't know how to use APIs.

spongebobstoes • yesterday at 6:22 PM

not everything has an API, or API use is limited. some UIs are more feature complete than their APIs

some sites try to block programmatic use

UI use can be recorded and audited by a non-technical person

steve1977 • yesterday at 6:37 PM

One could argue that LLMs learning programming languages made for humans (i.e. most of them) is using the wrong interface as well. Why not use machine code?

➕ show 3 replies

alt Hacker News

Replies