logoalt Hacker News

sburud11/04/20252 repliesview on HN

That’s cool! Slight fear of replicating the Dropbox comment here, but all you really need to do is run whisper (or some other speech2text), then once the user stops talking jam the transcript through a LLM to force it into JSON or some other sensible structure.


Replies

raybb11/04/2025

"once the user stops talking" is a key insight here for me. When using this I wasn't intentionally pausing to let it figure out an answer. It seemed to just pop up while I was talking. But upon experimenting some more it does seem to wait until here's a bit of a pause most of the time.

However it's still wild to me how fast and responsive it is. I can talk for 10 seconds and then in ~500ms I see the updates. Perhaps it doesn't even transcribe and rather feeds the audio to a multimodal llm along with whatever tasks it already knows about? Or maybe it's transcribing live as you talk and when you stop it sends it to the llm.

Anyone have a sense of what model they might be using?

show 1 reply
SteveMorin11/04/2025

https://boundaryml.com/

LLM to types and done