Or you could use Soniox Real-time (supports 60 languages) which natively supports endpoint detection - the model is trained to figure out when a user's turn ended. This always works better than VAD.
https://soniox.com/docs/stt/rt/endpoint-detection
Soniox also wins the independent benchmarks done by Daily, the company behind Pipecat.
https://www.daily.co/blog/benchmarking-stt-for-voice-agents/
You can try a demo on the home page:
Disclaimer: I used to work for Soniox
Edit: I commented too soon. I only saw VAD and immediately thought of Soniox which was the first service to implement real time endpoint detection last year.
I'm using them, how has it been like working there? I see they have some consumer products as well. I wonder how they get state of the art for such low prices over the competition.
If you read the post, you'll see that I used Deepgram's Flux. It also does endpointing and is a higher-level abstraction than VAD.