logoalt Hacker News

badmonster04/24/20251 replyview on HN

Congrats on the launch! love this idea. How does the LLM interact with the VM—screen+metadata as JSON, or higher-level planning?


Replies

frabonacci04/24/2025

Thanks, really appreciate it!

The LLM interacts with the VM through a structured virtual computer interface (cua-computer and cua-agent). It’s a high-level abstraction that lets the agent act (e.g., “open Terminal”, “type a command”, “focus an app”) and observe (e.g., current window, file system, OCR of the screen, active processes) in a way that feels a lot more like using a real computer than parsing raw data.

So under the hood, yes, screen+metadata are used (especially with the Omni loop and visual grounding), but what the model sees is a clean interface designed for agentic workflows - closer to how a human would think about using a computer.

If you're curious, the agent loops (OpenAI, Anthropic, Omni, UI-Tars) offer different ways of reasoning and grounding actions, depending on whether you're using cloud or local models.

https://github.com/trycua/cua/tree/main/libs/agent#agent-loo...

show 1 reply