logoalt Hacker News

TheAceOfHeartstoday at 6:20 PM1 replyview on HN

Interesting model, I've managed to get the 0.6B param model running on my old 1080 and I can generated 200 character chunks safely without going OOM, so I thought that making an audiobook of the Tao Te Ching would be a good test. Unfortunately each snippet varies drastically in quality: sometimes the speaker is clear and coherent, but other times it bursts out laughing or moaning. In a way it feels a bit like magical roulette, never being quite certain of what you're going to get. It does have a bit of charm, when you chain the various snippets together you really don't know what direction it's gonna go.

Using speaker Ryan seems to be the most consistent, I tried speaker Eric and it sounded like someone putting on a fake exaggerated Chinese accent to mock speakers.

If it wasn't for the unpredictable level of emotions from each chunk, I'd say this is easily the highest quality TTS model I've tried.


Replies

KaoruAoiShihotoday at 6:48 PM

Have you tried specifying the emotion? There's an option to do so and if it's left empty it wouldn't surprise me if it defaulted to rng instead of bland.

show 1 reply