Does Qwen3-Omni support real-time conversation like GPT-4o? Looking at their documentation it doesn&...

sosodev • last Wednesday at 4:55 PM • 4 replies • view on HN

Does Qwen3-Omni support real-time conversation like GPT-4o? Looking at their documentation it doesn't seem like it does.

Are there any open weight models that do? Not talking about speech to text -> LLM -> text to speech btw I mean a real voice <-> language model.

edit:

It does support real-time conversation! Has anybody here gotten that to work on local hardware? I'm particularly curious if anybody has run it with a non-nvidia setup.

Replies

potatoman22 • last Wednesday at 10:34 PM

From what I can tell, their official chat site doesn't have a native audio -> audio model yet. I like to test this through homophones (e.g. record and record) and asking it to change its pitch or produce sounds.

➕ show 3 replies

red2awn • last Wednesday at 7:38 PM

None of inference frameworks (vLLM/SGLang) supports the full model, let alone non-nvidia.

➕ show 3 replies

dsrtslnd23 • last Wednesday at 5:01 PM

it seems to be able to do native speech-speech

➕ show 1 reply

ivape • last Wednesday at 8:16 PM

That's exciting. I doubt there are any polished voice chat local apps yet that you can easily plug this into (I doubt the user experience is "there" yet). Even stuff like Silly Tavern is near unusable, lots of work to be done on the local front. Local voice models are what's going to enable that whole Minority Report workflow soon enough (especially if commands and intent are determined at the local level, and the meat of the prompt is handled by a larger remote model).

This is part of programming that I think is the new field. There will be tons of work for those that can build the new workflows which will need to be primarily natural language driven.

➕ show 1 reply

alt Hacker News

Replies