> This is hard. Basic solutions use time after silence to ‘determine’ when someone has stopped talking. But it adds latency. If you tune it to be too short, the AI agent will talk over you. Too long, and it’ll take a while to respond. The model had to be dedicated to accurately detecting end-of-turn based on conversation signals, and speculating on inputs to get a head start.
I spent time solving this exact problem at my last job. The best I got was getting a signal that thr conversion had ended down to ~200ms of latency through a very ugly hack.
I'm genuinely curious how others have solved this problem!
There's a really nice implementation of phrase endpointing here:
It uses three signals as input: silence interval, speech confidence, and audio level.Silence isn't literally silence -- or shouldn't be. Any "voice activity detection" library can be plugged into this code. Most people use Silero VAD. Silence is "non-speech" time.
Speech confidence also can come from either the VAD or another model (like a model providing transcription, or an LLM doing native audio input).
Audio level should be relative to background noise, as in this code. The VAD model should actually be pretty good at factoring out non-speech background noise, so the utility here is mostly speaker isolation. You want to trigger on speech end from the loudest of the simultaneous voices. (There are, of course, specialized models just for speaker isolation. The commercial ones from Krisp are quite good.)
One interesting thing about processing audio for AI phrase endpointing is that you don't actually care about human legibility. So you don't need traditional background noise reduction, in theory. Though, in practice, the way current transcription and speech models are trained, there's a lot of overlap with audio that has been recorded for humans to listen to!