logoalt Hacker News

sim04fullast Wednesday at 5:51 PM1 replyview on HN

The main issue I'm facing with realtime responses (speech output) is how to separate non-diegetic outputs (e.g thinking, structured outputs) from outputs meant to be heard by the end user.

I'm curious how anyone has solved this


Replies

artur44last Wednesday at 7:19 PM

A simple way is to split the model’s output stream before TTS. Reasoning/structured tokens go into one bucket, actual user-facing text into another. Only the second bucket is synthesized. Most thinking out loud issues come from feeding the whole stream directly into audio.

show 1 reply