Click coordinates. Agentic GUI is really annoying when the multi-modal agent cannot click on x,y coordinates.
I tested Qwen3.6, Gemma4, Nemotron3-nano-omni. They fully hallucinate x,y coords. (did not try GLM-5V yet)
GPT-5.5 can easily do it. But also Vocaela, a tiny 500M model, is quite good at it. Hope they improve the training for x,y clicking soon on the smallish multi-modals.
Recently slopped a http service together just so my local models can click, instead of relying on all the wild ways agents currently hack into the browser (browser-use, browser-harness, agent-browser, dev-browser etc) https://github.com/julius/vocaela-click-coords-http
Qwen3.5 is able to output click coordinates and bounding boxes just fine, as values normalized to 0..1000, I’d hope Qwen3.6 didn’t loose this capability.
This sounds a lot like another hacker news posted in the last few days. The same problem image generators have with a prompt like, produce numbers 1-50 in a spiral pattern and it can't count properly. But if you break it into a raster/vector where you have it first produce the visual content and then a SVG overlay, it's completely capable.
Have you tried doing a two step: review the image, then render a vector?
I've had lots of success with generating coordinates and answering questions using the UI-TARS model https://github.com/bytedance/UI-TARS.