Click coordinates. Agentic GUI is really annoying when the multi-modal agent cannot click on x,y coo...

julius • yesterday at 7:33 PM • 3 replies • view on HN

Click coordinates. Agentic GUI is really annoying when the multi-modal agent cannot click on x,y coordinates.

I tested Qwen3.6, Gemma4, Nemotron3-nano-omni. They fully hallucinate x,y coords. (did not try GLM-5V yet)

GPT-5.5 can easily do it. But also Vocaela, a tiny 500M model, is quite good at it. Hope they improve the training for x,y clicking soon on the smallish multi-modals.

Recently slopped a http service together just so my local models can click, instead of relying on all the wild ways agents currently hack into the browser (browser-use, browser-harness, agent-browser, dev-browser etc) https://github.com/julius/vocaela-click-coords-http

Replies

withinrafael • yesterday at 9:39 PM

I've had lots of success with generating coordinates and answering questions using the UI-TARS model https://github.com/bytedance/UI-TARS.

➕ show 1 reply

lopuhin • yesterday at 10:00 PM

Qwen3.5 is able to output click coordinates and bounding boxes just fine, as values normalized to 0..1000, I’d hope Qwen3.6 didn’t loose this capability.

cyanydeez • yesterday at 7:40 PM

This sounds a lot like another hacker news posted in the last few days. The same problem image generators have with a prompt like, produce numbers 1-50 in a spiral pattern and it can't count properly. But if you break it into a raster/vector where you have it first produce the visual content and then a SVG overlay, it's completely capable.

Have you tried doing a two step: review the image, then render a vector?

➕ show 1 reply

alt Hacker News

Replies