logoalt Hacker News

Sopro TTS: A 169M model with zero-shot voice cloning that runs on the CPU

348 pointsby sammyyyyyyylast Thursday at 8:37 PM120 commentsview on HN

Comments

realityfactchexlast Thursday at 9:44 PM

That's cool and useful.

IMO, the best alternative is Chatterbox-TTS-Server [0] (slower, but quite high quality).

[0] https://github.com/devnen/Chatterbox-TTS-Server

show 2 replies
armcatyesterday at 10:16 AM

Super nice! I've been using Kokoro locally, which is 82M parameters and runs (and sounds) amazing! https://huggingface.co/hexgrad/Kokoro-82M

show 2 replies
bcrlyesterday at 8:34 PM

What measures are being taken to ensure that this model isn't used to lower the cost of fraudsters committing grandparent scams by mimicking the voices of grandchildren?

show 1 reply
blitzarlast Thursday at 9:47 PM

Mission impossible cloning skills without the long compile time.

"The pleasure of Buzby's company is what I most enjoy. He put a tack on Miss Yancy's chair ..."

https://www.youtube.com/watch?v=H2kIN9PgvNo

https://literalminded.wordpress.com/2006/05/05/a-panphonic-p...

VerifiedReportsyesterday at 5:40 AM

What is "zero-shot" supposed to mean?

show 3 replies
yamal4321yesterday at 3:10 AM

Tried english. There are similarities. Really impressive for such budget Also increadibly easy to use, thanks for this

show 1 reply
btbuildemlast Thursday at 9:55 PM

It's impressive given the constraints!

Would you consider releasing a more capable version that renders with fewer artifacts (and maybe requires a bit more processing power)?

Chatterbox is my go-to, this could be a nice alternative were it capable of high-fidelity results!

show 1 reply
SoftTalkerlast Thursday at 10:10 PM

What does "zero-shot" mean in this context?

show 2 replies
guerrillayesterday at 12:49 AM

I don't understand the comments here at all. I played the audio and it sounds absolutely horrible, far worse than computer voices sounded fifteen years ago. Not even the most feeble minded person would mistake that as a human. Am I not hearing the same thing everyone else is hearing? It sounds straight up corrupted to me. Tested in different browsers, no difference.

show 5 replies
derefrlast Thursday at 10:26 PM

Is there yet any model like this, but which works as a "speech plus speech to speech" voice modulator — i.e. taking a fixed audio sample (the prompt), plus a continuous audio stream (the input), and transforming any speech component of the input to have the tone and timbre of the voice in the prompt, resulting in a continuous audio output stream? (Ideally, while passing through non-speech parts of the input audio stream; but those could also be handled other ways, with traditional source separation techniques, microphone arrays, etc.)

Though I suppose, for the use-case I'm thinking of (v-tubers), you don't really need the ability to dynamically change the prompt; so you could also simplify this to a continuous single-stream "speech to speech" model, which gets its target vocal timbre burned into it during an expensive (but one-time) fine-tuning step.

show 4 replies
LoveMortuusyesterday at 9:38 AM

This is very cool! And it'll only get better. I do wonder, if, at least as a patch-up job, they could do some light audio processing to remove the raspiness from the voices.

krunckyesterday at 2:06 AM

I just had some amusing results using text with lots of exclamations and turning up the temperature. Good fun.

woodsonlast Thursday at 10:36 PM

Does the 169M include the ~90M params for the Mimi codec? Interesting approach using FiLM for speaker conditioning.

show 1 reply
convivialdingolast Thursday at 9:37 PM

Impressive! The cloning and voice affect is great. Has a slight warble in the voice on long vowels, but not a huge issue. I'll definitely check it out - we could use voice generation for alerting on one of our projects (no GPUs on hardware).

show 1 reply
lukebechtellast Thursday at 9:27 PM

Very cool. I'd love a slightly larger version with hopefully improved voice quality.

Nice work!

show 1 reply
elauslast Thursday at 9:57 PM

Very nice to have done this by yourself, locally.

I wish there was an open/local tts model with voice cloning as good as 11l (for non-english languages even)

show 1 reply
jacquesmlast Thursday at 10:43 PM

What could possibly go wrong...

Don't you ever think about what the balance of good and bad is when you make something like this? What's the upside? What's the downside?

In this particular case I can only see downsides, if there are upsides I'd love to hear about them. All I see is my elderly family members getting 'me' on their phones asking for help, and falling for it.

I've gotten into the habit of waiting for the other person to speak first when I answer the phone now and the number is unknown to me.

show 4 replies
jokethrowawayyesterday at 11:27 AM

I'm sure it has its uses, but for anything with a higher requirement for quality, I think Vibe Voice is the only real OSS cloning option.

F2/E5 are also very good but have plenty of bad runs, you need to keep re-rolling until you get good outputs.

Gathering6678yesterday at 1:54 AM

Emm...I played the sample audio and it was...horrible?

How is it voice cloning if even the sample doesn't sound like any human being...

show 2 replies
sergiotapialast Thursday at 11:36 PM

It sounds a lot like RFK Jr! Does anyone have any more casual examples?

nunobritolast Thursday at 10:26 PM

Muito fixe. Now the next challenge (for me) is how to convert this to DART and run on Android. :-)

show 1 reply
jokethrowawayyesterday at 11:27 AM

Sorry but the quality is too bad.

I'm sure it has its uses, but for anything practical I think Vibe Voice is the only real OSS cloning option. F2/E5 are also very good but has plenty of bad runs, you need to keep re-rolling.

brikymlast Thursday at 10:40 PM

A scammers dream.

show 3 replies