logoalt Hacker News

raybb11/04/20251 replyview on HN

"once the user stops talking" is a key insight here for me. When using this I wasn't intentionally pausing to let it figure out an answer. It seemed to just pop up while I was talking. But upon experimenting some more it does seem to wait until here's a bit of a pause most of the time.

However it's still wild to me how fast and responsive it is. I can talk for 10 seconds and then in ~500ms I see the updates. Perhaps it doesn't even transcribe and rather feeds the audio to a multimodal llm along with whatever tasks it already knows about? Or maybe it's transcribing live as you talk and when you stop it sends it to the llm.

Anyone have a sense of what model they might be using?


Replies

makingstuffs11/04/2025

I cannot remember off the top of my head the exact number and am clearly too lazy to google it but there is a specific length of time in which, if no new noises pass through, the human brain processes it as a pause/silence.

I want to say 300ms which would coincide with your 500ms example

show 1 reply