The main issue I'm facing with realtime responses (speech output) is how to separate non-dieget...

sim04ful • last Wednesday at 5:51 PM • 1 reply • view on HN

The main issue I'm facing with realtime responses (speech output) is how to separate non-diegetic outputs (e.g thinking, structured outputs) from outputs meant to be heard by the end user.

I'm curious how anyone has solved this

Replies

artur44 • last Wednesday at 7:19 PM

A simple way is to split the model’s output stream before TTS. Reasoning/structured tokens go into one bucket, actual user-facing text into another. Only the second bucket is synthesized. Most thinking out loud issues come from feeding the whole stream directly into audio.

➕ show 1 reply

alt Hacker News

Replies