logoalt Hacker News

Show HN: I built a sub-500ms latency voice agent from scratch

292 pointsby nicktikhonovyesterday at 9:23 PM88 commentsview on HN

I built a voice agent from scratch that averages ~400ms end-to-end latency (phone stop → first syllable). That’s with full STT → LLM → TTS in the loop, clean barge-ins, and no precomputed responses.

What moved the needle:

Voice is a turn-taking problem, not a transcription problem. VAD alone fails; you need semantic end-of-turn detection.

The system reduces to one loop: speaking vs listening. The two transitions - cancel instantly on barge-in, respond instantly on end-of-turn - define the experience.

STT → LLM → TTS must stream. Sequential pipelines are dead on arrival for natural conversation.

TTFT dominates everything. In voice, the first token is the critical path. Groq’s ~80ms TTFT was the single biggest win.

Geography matters more than prompts. Colocate everything or you lose before you start.

GitHub Repo: https://github.com/NickTikhonov/shuo

Follow whatever I next tinker with: https://x.com/nick_tikhonov


Comments

jedbergtoday at 1:04 AM

Oh, this is really interesting to me. This is what I worked on at Amazon Alexa (and have patents on).

An interesting fact I learned at the time: The median delay between human speakers during a conversation is 0ms (zero). In other words, in many cases, the listener starts speaking before the speaker is done. You've probably experienced this, and you talk about how you "finish each other's sentences".

It's because your brain is predicting what they will say while they speak, and processing an answer at the same time. It's also why when they say what you didn't expect, you say, "what?" and then answer half a second later, when your brain corrects.

Fact 2: Humans expect a delay on their voice assistants, for two reasons. One reason is because they know it's a computer that has to think. And secondly, cell phones. Cell phones have a built in delay that breaks human to human speech, and your brain thinks of a voice assistant like a cell phone.

Fact 3: Almost no response from Alexa is under 500ms. Even the ones that are served locally, like "what time is it".

Semantic end-of-turn is the key here. It's something we were working on years ago, but didn't have the compute power to do it. So at least back then, end-of-turn was just 300ms of silence.

This is pretty awesome. It's been a few years since I worked on Alexa (and everything I wrote has been talked about publicly). But I do wonder if they've made progress on semantic detection of end-of-turn.

Edit: Oh yeah, you are totally right about geography too. That was a huge unlock for Alexa. Getting the processing closer to the user.

show 5 replies
stonelazytoday at 6:02 AM

This is a really solid writeup. The streaming pipeline architecture, the detailed latency breakdown per stage are genuinely useful. Building the core turn-taking loop from scratch is such a good exercise, and you did an excellent job explaining why each part matters and where the actual bottlenecks live. Strongly recommend this to anyone who wants to understand what’s really going on under the hood of a voice agent.

The one spot where it feels a bit off is the "2x faster than Vapi" claim. Your system is a clean straight pipe: transcript -> LLM -> TTS -> audio. No tool calls, no function execution, no webhooks, no mid-turn branching.

Production platforms like Vapi are doing way more work on every single turn. The LLM might decide to call a tool—search a knowledge base, hit an API, check a calendar—which means pausing token streaming, executing the tool, injecting the result back into context, re-prompting the LLM, and only then resuming the stream to TTS. That loop can happen multiple times in a single turn. Then layer on call recording, webhook delivery, transcript logging, multi-tenant routing, and all the reliability machinery you need for thousands of concurrent calls… and you’re comparing two pretty different workloads.

The core value of the post is that deep dive into the orchestration loop you built yourself. If it had just been "here’s what I learned rolling my own from scratch," it would’ve been an unqualified win. The 2x comparison just needs a quick footnote acknowledging that the two systems aren’t actually doing the same amount of work per turn.

brody_hamertoday at 12:23 AM

> Voice is a turn-taking problem

It really feels to me like there’s some low hanging fruit with voice that no one is capitalizing on: filler words and pacing. When the llm notices a silence, it fills it with a contextually aware filler word while the real response generates. Just an “mhmm” or a “right, right”. It’d go so far to make the back and forth feel more like a conversation, and if the speaker wasn’t done speaking; there’s no talking over the user garbage. (Say the filler word, then continue listening.)

show 5 replies
cootsnucktoday at 5:55 AM

Yea, Deepgram Flux is the secret sauce. Doesn't get talked about much.

For anyone curious: https://flux.deepgram.com/

armcatyesterday at 10:56 PM

This is an outstanding write up, thank you! Regarding LLM latency, OpenAI introduced web sockets in their Responses client recently so it should be a bit faster. An alternative is to have a super small LLM running locally on your device. I built my own pipeline fully local and it was sub second RTT, with no streaming nor optimisations https://github.com/acatovic/ova

show 1 reply
modelessyesterday at 10:49 PM

IMO STT -> LLM -> TTS is a dead end. The future is end-to-end. I played with this two years ago and even made a demo you can install locally on a gaming GPU: https://github.com/jdarpinian/chirpy, but concluded that making something worth using for real tasks would require training of end-to-end models. A really interesting problem I would love to tackle, but out of my budget for a side project.

show 6 replies
lukaxyesterday at 10:25 PM

Or you could use Soniox Real-time (supports 60 languages) which natively supports endpoint detection - the model is trained to figure out when a user's turn ended. This always works better than VAD.

https://soniox.com/docs/stt/rt/endpoint-detection

Soniox also wins the independent benchmarks done by Daily, the company behind Pipecat.

https://www.daily.co/blog/benchmarking-stt-for-voice-agents/

You can try a demo on the home page:

https://soniox.com/

Disclaimer: I used to work for Soniox

Edit: I commented too soon. I only saw VAD and immediately thought of Soniox which was the first service to implement real time endpoint detection last year.

show 2 replies
NickNaraghiyesterday at 9:34 PM

Pretty exciting breakthrough. This actually mirrors the early days of game engine netcode evolution. Since latency is an orchestration problem (not a model problem) you can beat general-purpose frameworks by co-locating and pipelining aggressively.

Carmack's 2013 "Latency Mitigation Strategies" paper[0] made the same point for VR too: every millisecond hides in a different stage of the pipeline, and you only find them by tracing the full path yourself. Great find with the warm TTS websocket pool saving ~300ms, perfect example of this.

[0]: https://danluu.com/latency-mitigation/

age123456gpgyesterday at 10:51 PM

Hi all! Check out this Handy app https://github.com/cjpais/Handy - a free, open source, and extensible speech-to-text application that works completely offline.

I am using it daily to drive Claude and it works really-well for me (much better than macOS dictation mode).

aanettoday at 5:30 AM

The voice samples sound fantastic The interruption handling is amazing. I felt you were talking to an actual person. It might have helped that he had a British accent :)

CharlesLautoday at 5:37 AM

I suprisely noticed that the GitHub repository's name is actually a madarian character 说(speak).

kaonwarbtoday at 4:02 AM

One of the challenges with trying to achieve IRL human-level latency is that we rely on nonverbal cues for face-to-face turn-taking. See e.g. https://www.sciencedirect.com/science/article/pii/S001002772...

docheinestagesyesterday at 10:58 PM

Does anyone know about a fully offline, open-source project like this voice agent (i.e. STT -> LLM -> TTS)?

show 1 reply
suganesh95today at 2:53 AM

This is great. I built 3 assistants last week for same purpose with entirely different tech stack.

(Raspberry Pi Voice Assistant)

Jarvis uses Porcupine for wake word detection with the built-in "jarvis" keyword. Speech input flows through ElevenLabs Scribe v2 for transcription. The LLM layer uses Groq llama-3.3-70b-versatile as primary with Groq llama-3.1-8b-instant as fallback. Text-to-speech uses Smallest.ai Lightning with Chetan voice. Audio input/output handled by ALSA (arecord/aplay). End-to-end latency is 3.8–7.3 seconds.

(Twilio + VPS)

This setup ingests audio via Twilio Media Streams in μ-law 8kHz format. Silero VAD detects speech for turn boundaries. Groq Whisper handles batch transcription. The LLM stack chains Groq llama-4-scout-17b (primary), Groq llama-3.3-70b-versatile (fallback 1), and Groq llama-3.1-8b-instant (fallback 2) with automatic failover. Text-to-speech uses Smallest.ai Lightning with Pooja voice. Audio is encoded from PCM to μ-law 8kHz before streaming back via Twilio. End-to-end latency is 0.5–1.1 seconds.

───

(Alexa Skill)

Tina receives voice input through Alexa's built-in ASR, followed by Alexa's NLU for intent detection. The LLM is Claude Haiku routed through the OpenClaw gateway. Voice output uses Alexa's native text-to-speech. End-to-end latency is 1.5–2.5 seconds.

erutoday at 4:57 AM

> [...] and no precomputed responses.

You could probably improve your metrics even more with those in the mix again?

ggmtoday at 3:40 AM

Thats half a second delay. 0.4 to 0.5 seconds. Thats the same as the delay in a GEO orbit satellite mediated phone conversation.

Perhaps I'm in an older cohort, but I remember this delay, and what it felt like sustaining a conversation with this class of delay.

(it's still a remarkable advance, but do bear in mind the UX)

hosakatoday at 3:38 AM

Depending on the TTS model being used latency can be reduced further yet with an LRU cache, fetching common phrases from cache instead of generating fresh with TTS.

However the naturalness of how it sounds will depend on how the TTS model works and whether two identical chunks of text will sound alike every generation.

suganesh95today at 2:44 AM

I built something very similar and comparble to this with wakeword detection on my raaberry pi.

Groq 8b instant is the fastest llm from my test. I used smallest ai for tts as it has the smallest TTFT

My rasberry pi stack: porcupine for wakeword detection + elevenlabs for STT + groq scout as it supports home automation better + smallest.ai for 70ms ttfb

Call stack: twilio + groq whisper for STT + groq 8b instant + smallest.ai for tts

Alexa skill stack: wrote a alexa skill to contact my stack running on a VPS server

loevborgyesterday at 10:35 PM

Nice write-up, thanks for sharing. How does your hand-vibed python program compare to frameworks like pipecat or livekit agents? Both are also written in python.

show 1 reply
perelinyesterday at 10:40 PM

Great writeup! For VAD did you use heaphone/mic combo, or an open mic? If open, how did you deal with the agent interupting itself?

show 1 reply
MbBrainzyesterday at 9:31 PM

Love it! Solving the latency problem is essential to making voice ai usable and comfortable. Your point on VAD is interesting - hadn't thought about that.

kelvinjps10today at 3:04 AM

The quality of the post was amazing, I'm not that interested into voice agents yet but that I was engaged in the whole post. And the little animation made it easier to understand the loop.

nmstokertoday at 12:22 AM

This was discussed 21 days ago:

https://news.ycombinator.com/item?id=46946705

show 1 reply
boznzyesterday at 10:45 PM

"Voice is an orchestration problem" is basically correct. The two takeaways from this for me are

1. I wonder if it could be optimised more by just having a single language, and

2. How do we get around the problem of interference, humans are good at conversation discrimination ie listing while multiple conversations, TV, music, etc are going on in the background, I've not had too much success with voice in noisy environments.

waynerisnertoday at 2:36 AM

I am really curious about this for enunciation, articulation, and accessibility applications.

bronco21016today at 2:28 AM

When someone is able to put something like this together on their own it leaves me feeling infuriated that we can’t have nice things on consumer hardware.

At a minimum Siri, Alexa, and Google Home should at least have a path to plugin a tool like this. Instead I’m hacking together conversation loops in iOS Shortcuts to make something like this style of interaction with significantly worse UX.

grayhattertoday at 12:35 AM

You made, or you asked an LLM to generate?

show 1 reply
mst98today at 4:16 AM

This is so cool

shubh-chatyesterday at 11:53 PM

This is superb, Nick! Thanks for this. Will try it out at somepoint for a project I am trying to build.

jangletownyesterday at 9:52 PM

impressive

hackersktoday at 1:39 AM

[dead]

aplomb1026today at 12:31 AM

[dead]

umairnadeem123today at 4:19 AM

[flagged]

andrewmcwatterstoday at 12:46 AM

[dead]

CagedJeanyesterday at 11:04 PM

[flagged]

show 1 reply
foxestoday at 1:51 AM

<think> I need to generate a Show HN: style comment to maximise engagement as the next step. Let's break this down:

First I'll describe the performance metrics and the architecture.

Next I'll elaborate on the streaming aspect and the geographical limitations important to the performance.

Finally the user asked me to make sure to keep the tone appropriate to Hacker News and to link their github – I'll make sure to include the link. </think>