The 60-minute single-pass transcription is the part that actually matters. Most ASR models chunk audio and you lose speaker continuity across boundaries. If the diarization actually holds up on hour-long recordings without drifting, thats a real solve for podcast and meeting transcription workflows.
I think we should stop calling this type of models open source. They are indeed "open weight." The training code is proprietary and never revealed.
When mixing languages, why does the English have Chinese accent and Chinese have English accent? Is it a feature or bug?
I think in this category, Voxtral by Mistral is a lot better. It also happens to be small enough to run on webGPU https://huggingface.co/spaces/mistralai/Voxtral-Realtime-Web...
Interesting story about this repo/product/author by cybersecurity researcher Kevin Beaumont: https://cyberplace.social/@GossiTheDog/116454846703138243
Isn't this project the one Microsoft published but then soon after pulled it for security/safety reasons? What has changed since then?
Interesting to see "vibe" enshrined by the likes of Microsoft as an AI product word.
Surprised it wasn't called Copilot Voice
Great post last night from Simon: https://simonwillison.net/2026/Apr/27/vibevoice/
Still waiting for the open weights model that conclusively beats the multi-year old Whisper in accuracy, features, and performance.
Holy moly, a Microsoft AI product that isn't named Copilot!
You have selected Microsoft Sam as the computer's default voice.
Seriously, VibeVoice? Microslop really has a penchant for the worst names.
So we've really just settled on Vibe as the verb for AI then?
I've been using VibeVoice's ASR (speech to text) model quite intensively for the past month and have found it to be a lot more reliable and out-of-the box functional then Whisper, parakeet and other models. The fact that is has diarization built into to the model is a huge win in my book. Without that you have to run a different model just for that which adds significantly to the overall processing time vs VibeVoice which gives you reliably great results. Big fan.
Microsoft Store App Vibing.exe Accused of Harvesting Screens, Audio, and Clipboard Data:
https://cyberpress.org/microsoft-store-app-vibing-exe-accuse...
Explains most of the shit they have pushing with Windows 11. Perhaps all that bloatware was VibeVoiced too.
Someone tell me if this is better or worse than Parakeet
Shouldn't it be called something like "Copilot Voice"?
I took a look into local options for ASR and diarization some months ago, I missed that VibeVoice now has this feature.
My conclusions back then (which only came from a shallow research on the topic and 0 real experience mind you) was that Whisper + Pyannote was the "stable" approach.
Have the VibeVoice, Voxtral, Qwen or the Nemo solutions caught up in segmentation and speaker recognition?
Microsoft has historically made poor choices in product naming, but this has to be a new low.
I the past month or so, I added 2 models to my app Whisper Memos (https://whispermemos.com):
- Cohere Transcribe (self hosted)
- Grok Speech To Text (they provide an API, only $0.10/hr!)
They are both excellent. I'm not sure about this one. Would you like to see it in a consumer speech to text app?
What’s the current state of the art, for each of training locally and in the cloud, for learning my voice?
What the do they mean by frontier voice
It would have been better if they provided not just weights, but also some frontend where it is usable as is.
This is a very good model, but can it be run on the web?
Maybe Microsoft’s real strength was never making the best model, it was knowing you don’t need to, as long as you own the platform everyone builds on.
For me its giving me very poor results
looks like this offers ASR support in GGUF https://github.com/CrispStrobe/CrispASR -- haven't tested
What a terrible name
English only?
Seems quite heavy for a STT model, Parakeet and Whisper are much smaller and perform great for quick dictation and transcription of longer files. I guess that's due to additional accuracy and speaker diarisation?
The TTS example clip in the repo of 'spontaneous singing' is creepy as fuck
Microsoft is famous for choosing terrible names but how could they be this terrible.
lol they rug-pulled the 7B for our own safety some months ago
This is not a new model. Also, it hallucinates a lot. Also, it's very heavy and slow in inference. It's also bad in multilingual.
Edit: I'm talking purely about speech to text (STT). Not sure about the other things this can do.