I took a look into local options for ASR and diarization some months ago, I missed that VibeVoice now has this feature.
My conclusions back then (which only came from a shallow research on the topic and 0 real experience mind you) was that Whisper + Pyannote was the "stable" approach.
Have the VibeVoice, Voxtral, Qwen or the Nemo solutions caught up in segmentation and speaker recognition?
It highly depends on the sort of data you’re processing (phone calls, podcasts, meetings of more people recorded using single channel?). For NVIDIA/NeMo, check out their softformer diarization models (also streaming).