> Non-verbal cues are invisible to text: Transcription-based models discard sighs, throat-clearin...

nextaccountic • 01/15/2026 • 1 reply • view on HN

> Non-verbal cues are invisible to text: Transcription-based models discard sighs, throat-clearing, hesitation sounds, and other non-verbal vocalizations that carry critical conversational-flow information. Sparrow-1 hears what ASR ignores.

Could Sparrow instead be used to produce high quality transcription that incorporate non-verbal cues?

Or even, use Sparrow AND another existing transcription/ASR thing to augment the transcription with non-verbal cues

Replies

bpanahij • 01/15/2026

This is a very good idea. We currently have a model in our perception system (Raven-1) that performs this partially. It uses audio to understand tone and augment the transcription we send to the conversational LLM. That seems to have an impact on the conversational style of the replicas output, in a good way. We’re still evaluating that model and will post updates when we have better insights.

alt Hacker News

Replies