logoalt Hacker News

eduntemanyesterday at 11:16 PM0 repliesview on HN

Thanks for sharing your experience! I'd love to chat about what you did to make this work, if I may use it to inform the design of this system. I'm at erik [at] pig.dev

To clarify, the use of CLIP embeddings in the CUA example is an implementation decision for the CUA example, not core to the engine itself.

This was very intentional in the design of Check being a pair of Capture() -> T and Compare(current: T, candidate: T) -> bool. T can be any data type that can serialize to a DB, and the comparison is user-defined to operate on that generic type T.

A more complete CUA example would store features like OCR'ed text, Accessibility Tree data, etc.

I'll use now to call out a few outstanding questions that I don't yet have answers for:

- Parameterization. Rather than caching and reusing strict coordinates, what happens when the arguments of a tool call are derived from the top level prompt, or even more challenging, as the result of a previous tool call. In the case of computer use, perhaps a very specific element x-path is needed, but that element is not "compile time known", rather derived mid-trajectory.

- What would it look like to stack compare filters? IE, if a user wanted to first filter by cosine distance, and then later apply more strict checks on OCR contents.

- As you mentioned, how can you store some knowledge of environment features where change *is* expected. Datetime in the bottom right is the perfect example of this.