logoalt Hacker News

eduntemanyesterday at 11:35 PM0 repliesview on HN

Our experience working with A11y apis like above is that data is frequently missing, and the APIs can be shockingly slow to read from. The highest performing agents in WindowsArena use a mixture of A11y and yolo-like grounding models such as Omniparser, with A11y seeming shifting out of vogue in favor of computer vision, due to it giving incomplete context.

Talking with users who just write their own RPA, they most loved APIs for doing so was consistently https://github.com/asweigart/pyautogui, which does offer A11y APIs but they're messy enough that many of the teams I talked to used the pyautogui.locateOnScreen('button.png') fuzzy image matching feature.