I have dystonia which often stiffens my arms in a way that makes it impossible for me to type on a keyboard. TTS apps like SuperWhisper have proven to be very helpful for me in such situations. I am hoping to get a similar experience out of "Handy" (very apt maming from my perspective).
I do, however, wonder if there is a way all these TTS tools can get to the next level. The generated text should not be just a verbatim copy of what I just said, but depending on the context, it should elaborate. For example, if my cursor is actively inside an editor/IDE with some code, my coding-related verbal prompts should actually generate the right/desired code in that IDE.
Perhaps this is a bit of combining TTS with computer-use.
What you said is possible by feeding the output of speech-to-text tools into an LLM. You can prompt the LLM to make sense of what you're trying to achieve and create sets of actions. With a CLI it’s trivial, you can have your verbal command translated into working shell commands. With a GUI it’s slightly more complicated because the LLM agent needs to know what you see on the screen, etc.
That CLI bit I mentioned earlier is already possible. For instance, on macOS there’s an app called MacWhisper that can send dictation output to an OpenAI‑compatible endpoint.
I totally agree with you and largely what you’re describing is one of the reasons I made Handy open source. I really want to see something like this and see someone go experiment with making it happen. I did hear some people playing with using some small local models (moondream, qwen) to get some more context of the computer itself
I initially had a ton of keyboard shortcuts in handy for myself when I had a broken finger and was in a cast. It let me play with the simplest form of this contextual thing, as shortcuts could effectively be mapped to certain apps with very clear uses cases