This is not a new model. Also, it hallucinates a lot. Also, it's very heavy and slow in inference. It's also bad in multilingual.
Edit: I'm talking purely about speech to text (STT). Not sure about the other things this can do.
Yeah, I don't get why it is suddenly getting so much attention today, it is all over twitter too
Saved a lot of my time thanks!
you saved us a lot of time here.... i unstarred the repo
moving on....
You just saved me an afternoon.
I'm shocked, shocked to find that Microsoft takes credit for a slow, unoriginal product that doesn't actually do what it advertises.
It is not good for text to speech (TTS) as well. I am trying it for few days. First of all 1.5B model documentation is not there. 0.5B realtime is shit model. I was converting text, line by line and it was randomly adding music and couldn't handle special characters like "…".
I really disappointed with this model to say the least.
I think this was all covered when they said it was released by Microsoft?
It has some perks, is a bit more expressive in some cases, but overall is trained on really noisy data, uses more memory, and isn't that fast - I'm talking about the (7b?) version that they released then removed quickly (vibevoice-community on github) - I still use chatterbox turbo and sometimes qwen TTS.