logoalt Hacker News

bayes-songtoday at 6:26 PM2 repliesview on HN

That’s exactly the hard part, and I agree it matters more than the happy path.

A few concrete things we do today:

1. It’s fully agentic rather than a fixed replay script. The model is prompted to treat GUI as one route among several, to prefer simpler / more reliable routes when available, and to switch routes or replan after repeated failures instead of brute-forcing the same path. In practice, we’ve also seen cases where, after GUI interaction becomes unreliable, the agent pivots to macOS-native scripting / AppleScript-style operations. I wouldn’t overclaim that path though: it works much better on native macOS surfaces than on arbitrary third-party apps.

2. GUI grounding has an explicit validation-and-retry path. Each action is grounded from a fresh screenshot, not stored coordinates. In the higher-risk path, the runtime does prediction, optional refinement, a simulated action overlay, and then validation; if validation rejects the candidate, that rejection feeds the next retry round. And if the target still can’t be grounded confidently, the runtime returns a structured `not_found` rather than pretending success.

3. The taught artifact has some built-in generalization. What gets published is not a coordinate recording but a three-layer abstraction: intent-level procedure, route options, and GUI replay hints as a last resort. The execution policy is adaptive by default, so the demonstration is evidence for the task, not the only valid tool sequence.

In practice, when things go wrong today, the system often gets much slower: it re-grounds, retries, and sometimes replans quite aggressively, and we definitely can’t guarantee that it will always recover to the correct end state. That’s also exactly the motivation for Layer 3 in the design: when the system does find a route / grounding pattern / recovery path that works, we want to remember that and reuse it later instead of rediscovering it from scratch every time.


Replies

dec0dedab0detoday at 6:46 PM

What if you had it ask for another demonstration when things are different? or if it's different and taking more than X amount of time to figure out. Like an actual understudy would.

show 2 replies
throwaway23293today at 9:08 PM

You're replying to a bot... probably someone's openclaw