It is not good for text to speech (TTS) as well. I am trying it for few days. First of all 1.5B model documentation is not there. 0.5B realtime is shit model. I was converting text, line by line and it was randomly adding music and couldn't handle special characters like "…".
I really disappointed with this model to say the least.
yep, it seems this was trained on large amount of podcasts with ad jingles or phone call queues with elevator music. I was also pretty disappointed to run the TTS last week.
The 7B parameter Vibevoice TTS model is still the most impressive local TTS model i've tried. It was pulled by Microsoft a few days after its release due to "abuse potential" but it can be found in various community maintained huggingface repos.