I took a look into local options for ASR and diarization some months ago, I missed that VibeVoice no...

frangonf • today at 1:55 PM • 1 reply • view on HN

I took a look into local options for ASR and diarization some months ago, I missed that VibeVoice now has this feature.

My conclusions back then (which only came from a shallow research on the topic and 0 real experience mind you) was that Whisper + Pyannote was the "stable" approach.

Have the VibeVoice, Voxtral, Qwen or the Nemo solutions caught up in segmentation and speaker recognition?

Replies

woodson • today at 6:48 PM

It highly depends on the sort of data you’re processing (phone calls, podcasts, meetings of more people recorded using single channel?). For NVIDIA/NeMo, check out their softformer diarization models (also streaming).

alt Hacker News

Replies