Not sure if this can work. We played around with something similar too for computer use, but comparing embeddings to cache validate the starting position is super gray, no clear threshold. For example, the datetime on the bottom right changes. Or if it's an app with a database etc, it can change the embeddings arbitrarily. Also, you must do this in every step, because as you said, things might break at any point. I just don't see how you can reliably validate. If anything, if models are cheap, you could use another cheaper llm call to compare screenshots, or adjust the playwright/api script on the fly. We ended up writing up a quite different approach that worked surprisingly well.
There are definitely a lot of potential solutions, I'm curious where this goes. IMO an embeddings approach won't be enough. I'm more than happy to discuss what we did internally to achieve a decent rate though, the space is super promising for sure.
Hey, working on a personal project. Would love to dig into how you approached this.
Thanks for sharing your experience! I'd love to chat about what you did to make this work, if I may use it to inform the design of this system. I'm at erik [at] pig.dev
To clarify, the use of CLIP embeddings in the CUA example is an implementation decision for the CUA example, not core to the engine itself.
This was very intentional in the design of Check being a pair of Capture() -> T and Compare(current: T, candidate: T) -> bool. T can be any data type that can serialize to a DB, and the comparison is user-defined to operate on that generic type T.
A more complete CUA example would store features like OCR'ed text, Accessibility Tree data, etc.
I'll use now to call out a few outstanding questions that I don't yet have answers for:
- Parameterization. Rather than caching and reusing strict coordinates, what happens when the arguments of a tool call are derived from the top level prompt, or even more challenging, as the result of a previous tool call. In the case of computer use, perhaps a very specific element x-path is needed, but that element is not "compile time known", rather derived mid-trajectory.
- What would it look like to stack compare filters? IE, if a user wanted to first filter by cosine distance, and then later apply more strict checks on OCR contents.
- As you mentioned, how can you store some knowledge of environment features where change *is* expected. Datetime in the bottom right is the perfect example of this.