> However, I would like to point out that Apple isn't totally wrong here because the accessibility API unfortunately is way too broadly scoped, and because of that you literally get access to everything on the computer like you you can screenshot listen and and move the cursor... This is completely ridiculous and the proper engineering solution would actually be to phase out the accessibility API and replace it with something that is narrowly scoped so you can grant specific permissions individually
If you don't have use of your hands you want that. The whole point of accessibility APIs is allowing arbitrary control of your computer via novel means. One of the big selling points of Dragon Natually Speaking is the ability to tell your computer to do things based on descriptions without a mouse. "open outlook", "click compose", "select subject", "type foo", etc.
And no the solution here is not computer vision with an LLM. Text and buttons rendered on my computer exist in memory somewhere as text and buttons. We should not need to convert them to pixels and back lossily to recover text and buttons. We should just expose things to the accessibility API and not guess.
> And no the solution here is not computer vision with an LLM.
Also, even if you hypothetically wanted to use computer vision with an LLM… what API is that LLM going to use to take screenshots and click on stuff?
> Chrome and anything electron based don't provide any accessibility information to the OS
Are we sure about this? At least on windows, NVDA works fine with chrome and any electron apps.