A simple way is to split the model’s output stream before TTS. Reasoning/structured tokens go...

artur44 • 12/10/2025 • 1 reply • view on HN

A simple way is to split the model’s output stream before TTS. Reasoning/structured tokens go into one bucket, actual user-facing text into another. Only the second bucket is synthesized. Most thinking out loud issues come from feeding the whole stream directly into audio.

Replies

pugio • 12/10/2025

There is no TTS here. It's a native audio output model which outputs audio tokens directly. (At least, that's how the other real-time models work. Maybe I've misunderstood the Qwen-Omni architecture.)

➕ show 1 reply

alt Hacker News

Replies