Speaking of browser automation, are there any LLMs or tools that hook up to actual desktop browsers and can automate the keyboard and mouse?
Which LLMs best drive these? Claude/Gemini, etc., or is anything local actually competent at it?
Can they understand layout and visual cues with a VLM or multimodality?
Are they robust enough to interact with threejs and videos and whatnot, or can they just blindly navigate the DOM?