logoalt Hacker News

kwindla10/02/20241 replyview on HN

There's a really nice implementation of phrase endpointing here:

  https://github.com/pipecat-ai/pipecat/blob/d378e699d23029e8ca7cea7fb675577becd5ebfb/src/pipecat/vad/vad_analyzer.py
It uses three signals as input: silence interval, speech confidence, and audio level.

Silence isn't literally silence -- or shouldn't be. Any "voice activity detection" library can be plugged into this code. Most people use Silero VAD. Silence is "non-speech" time.

Speech confidence also can come from either the VAD or another model (like a model providing transcription, or an LLM doing native audio input).

Audio level should be relative to background noise, as in this code. The VAD model should actually be pretty good at factoring out non-speech background noise, so the utility here is mostly speaker isolation. You want to trigger on speech end from the loudest of the simultaneous voices. (There are, of course, specialized models just for speaker isolation. The commercial ones from Krisp are quite good.)

One interesting thing about processing audio for AI phrase endpointing is that you don't actually care about human legibility. So you don't need traditional background noise reduction, in theory. Though, in practice, the way current transcription and speech models are trained, there's a lot of overlap with audio that has been recorded for humans to listen to!


Replies

com2kid10/02/2024

> There's a really nice implementation of phrase endpointing here:

VAD doesn't get you enough accuracy at this level. Confidence is the key bit, how that is done is what makes the experience magic!