There is no TTS here. It's a native audio output model which outputs audio tokens directly. (At...

pugio • 12/10/2025 • 1 reply • view on HN

There is no TTS here. It's a native audio output model which outputs audio tokens directly. (At least, that's how the other real-time models work. Maybe I've misunderstood the Qwen-Omni architecture.)

Replies

artur44 • 12/10/2025

True, but even with native audio-token models you still need to split the model’s output channels. Reasoning/internal tokens shouldn't go into the audio stream only user-facing content should be emitted as audio. The principle is the same, whether the last step is TTS or audio token generation.

➕ show 1 reply

alt Hacker News

Replies