Ask HN: What's the current best local/open speech-to-speech setup?

62 points • by dsrtslnd23 • yesterday at 11:04 AM • 15 comments • view on HN

I’m trying to do the “voice assistant” thing fully locally: mic → model → speaker, low latency, ideally streaming + interruptible (barge-in).

Qwen3 Omni looks perfect on paper (“real-time”, speech-to-speech, etc). But I’ve been poking around and I can’t find a single reproducible “here’s how I got the open weights doing real speech-to-speech locally” writeup. Lots of “speech in → text out” or “audio out after the model finishes”, but not a usable realtime voice loop. Feels like either (a) the tooling isn’t there yet, or (b) I’m missing the secret sauce.

What are people actually using in 2026 if they want open + local voice?

Is anyone doing true end-to-end speech models locally (streaming audio out), or is the SOTA still “streaming ASR + LLM + streaming TTS” glued together?

If you did get Qwen3 Omni speech-to-speech working: what stack (transformers / vLLM-omni / something else), what hardware, and is it actually realtime?

What’s the most “works today” combo on a single GPU?

Bonus: rough numbers people see for mic → first audio back

Would love pointers to repos, configs, or “this is the one that finally worked for me” war stories.

Comments

mpaepper • yesterday at 9:50 PM

You should look into the new Nvidia model: https://research.nvidia.com/labs/adlr/personaplex/

It has dual channel input / output and a very permissible license

➕ show 2 replies

marsbars241 • today at 1:49 AM

Tangential: What hardware are you using for the interface on these? Is there a good array microphone that performs on par with echos/ghomes/homepods?

amelius • yesterday at 11:35 PM

For the TTS part: https://github.com/supertone-inc/supertonic

varik77 • today at 1:52 AM

I have used https://github.com/SaynaAI/sayna . What I like the most is that you can switch between the providers easily and see what works for you the best. It also supports local models.

hedgehog • today at 2:04 AM

I haven't tried them myself but the Kyutai has a couple projects that could fit.

https://kyutai.org

dfajgljsldkjag • today at 12:53 AM

It requires a bit of tinkering, but I think pipecat is the way to go. You can plug in pretty much any STT/LLM/TTS you want and go. It definitely supports local models but its up to you to get your hands on those models.

Not sure if there's any turnkey setups that are preconfigured for local install where you can just press play and go though.

Last I heard E2E speech to speech models are still pretty weak. I've had pretty bad results from gpt-realtime and that's a proprietary model, I'm assuming open source is a bit behind.

DANmode • today at 1:07 AM

https://handy.computer got good marks from a very nontechnical user in my life this week!

Local, FOSS

Johnny_Bonk • yesterday at 11:29 PM

Anyone using any reasonably good small speech to text os models?

➕ show 1 reply

jauntywundrkind • yesterday at 9:39 PM

It was a little annoying getting old qt5 tools installed but I really enjoyed using dsnote / Speech Note. Huge model selection for my amd gpu. Good tool. I haven't done enough specific studying yet to give you suggestions for which model to go with. WhisperFlow is very popular.

Kyutai some very interesting work always. Their delayed streams work is bleeding edge & sounds very promising especially for low latency. Not sure why I have not yet tried it tbh. https://github.com/kyutai-labs/delayed-streams-modeling

There's also a really nice elegant simple app Handy. Only supports Whisper and Parakeet V3 but nice app & those are amazing models. https://github.com/cjpais/Handy

hackomorespacko • today at 12:16 AM

[flagged]

alt Hacker News

Ask HN: What's the current best local/open speech-to-speech setup?

Comments