I’m trying to do the “voice assistant” thing fully locally: mic → model → speaker, low latency, ideally streaming + interruptible (barge-in).
Qwen3 Omni looks perfect on paper (“real-time”, speech-to-speech, etc). But I’ve been poking around and I can’t find a single reproducible “here’s how I got the open weights doing real speech-to-speech locally” writeup. Lots of “speech in → text out” or “audio out after the model finishes”, but not a usable realtime voice loop. Feels like either (a) the tooling isn’t there yet, or (b) I’m missing the secret sauce.
What are people actually using in 2026 if they want open + local voice?
Is anyone doing true end-to-end speech models locally (streaming audio out), or is the SOTA still “streaming ASR + LLM + streaming TTS” glued together?
If you did get Qwen3 Omni speech-to-speech working: what stack (transformers / vLLM-omni / something else), what hardware, and is it actually realtime?
What’s the most “works today” combo on a single GPU?
Bonus: rough numbers people see for mic → first audio back
Would love pointers to repos, configs, or “this is the one that finally worked for me” war stories.
Tangential: What hardware are you using for the interface on these? Is there a good array microphone that performs on par with echos/ghomes/homepods?
For the TTS part: https://github.com/supertone-inc/supertonic
I have used https://github.com/SaynaAI/sayna . What I like the most is that you can switch between the providers easily and see what works for you the best. It also supports local models.
I haven't tried them myself but the Kyutai has a couple projects that could fit.
It requires a bit of tinkering, but I think pipecat is the way to go. You can plug in pretty much any STT/LLM/TTS you want and go. It definitely supports local models but its up to you to get your hands on those models.
Not sure if there's any turnkey setups that are preconfigured for local install where you can just press play and go though.
Last I heard E2E speech to speech models are still pretty weak. I've had pretty bad results from gpt-realtime and that's a proprietary model, I'm assuming open source is a bit behind.
https://handy.computer got good marks from a very nontechnical user in my life this week!
Local, FOSS
Anyone using any reasonably good small speech to text os models?
It was a little annoying getting old qt5 tools installed but I really enjoyed using dsnote / Speech Note. Huge model selection for my amd gpu. Good tool. I haven't done enough specific studying yet to give you suggestions for which model to go with. WhisperFlow is very popular.
Kyutai some very interesting work always. Their delayed streams work is bleeding edge & sounds very promising especially for low latency. Not sure why I have not yet tried it tbh. https://github.com/kyutai-labs/delayed-streams-modeling
There's also a really nice elegant simple app Handy. Only supports Whisper and Parakeet V3 but nice app & those are amazing models. https://github.com/cjpais/Handy
[flagged]
You should look into the new Nvidia model: https://research.nvidia.com/labs/adlr/personaplex/
It has dual channel input / output and a very permissible license