logoalt Hacker News

gagan2020today at 3:05 PM2 repliesview on HN

It is not good for text to speech (TTS) as well. I am trying it for few days. First of all 1.5B model documentation is not there. 0.5B realtime is shit model. I was converting text, line by line and it was randomly adding music and couldn't handle special characters like "…".

I really disappointed with this model to say the least.


Replies

Stagnanttoday at 6:40 PM

The 7B parameter Vibevoice TTS model is still the most impressive local TTS model i've tried. It was pulled by Microsoft a few days after its release due to "abuse potential" but it can be found in various community maintained huggingface repos.

tjungbluttoday at 7:19 PM

yep, it seems this was trained on large amount of podcasts with ad jingles or phone call queues with elevator music. I was also pretty disappointed to run the TTS last week.