Sopro TTS: A 169M model with zero-shot voice cloning that runs on the CPU

348 points • by sammyyyyyyy • last Thursday at 8:37 PM • 120 comments • view on HN

Comments

realityfactchex • last Thursday at 9:44 PM

That's cool and useful.

IMO, the best alternative is Chatterbox-TTS-Server [0] (slower, but quite high quality).

[0] https://github.com/devnen/Chatterbox-TTS-Server

➕ show 2 replies

Super nice! I've been using Kokoro locally, which is 82M parameters and runs (and sounds) amazing! https://huggingface.co/hexgrad/Kokoro-82M

➕ show 2 replies

bcrl • yesterday at 8:34 PM

What measures are being taken to ensure that this model isn't used to lower the cost of fraudsters committing grandparent scams by mimicking the voices of grandchildren?

➕ show 1 reply

blitzar • last Thursday at 9:47 PM

Mission impossible cloning skills without the long compile time.

"The pleasure of Buzby's company is what I most enjoy. He put a tack on Miss Yancy's chair ..."

https://www.youtube.com/watch?v=H2kIN9PgvNo

https://literalminded.wordpress.com/2006/05/05/a-panphonic-p...

VerifiedReports • yesterday at 5:40 AM

What is "zero-shot" supposed to mean?

➕ show 3 replies

yamal4321 • yesterday at 3:10 AM

Tried english. There are similarities. Really impressive for such budget Also increadibly easy to use, thanks for this

➕ show 1 reply

btbuildem • last Thursday at 9:55 PM

It's impressive given the constraints!

Would you consider releasing a more capable version that renders with fewer artifacts (and maybe requires a bit more processing power)?

Chatterbox is my go-to, this could be a nice alternative were it capable of high-fidelity results!

➕ show 1 reply

SoftTalker • last Thursday at 10:10 PM

What does "zero-shot" mean in this context?

➕ show 2 replies

guerrilla • yesterday at 12:49 AM

I don't understand the comments here at all. I played the audio and it sounds absolutely horrible, far worse than computer voices sounded fifteen years ago. Not even the most feeble minded person would mistake that as a human. Am I not hearing the same thing everyone else is hearing? It sounds straight up corrupted to me. Tested in different browsers, no difference.

➕ show 5 replies

derefr • last Thursday at 10:26 PM

Is there yet any model like this, but which works as a "speech plus speech to speech" voice modulator — i.e. taking a fixed audio sample (the prompt), plus a continuous audio stream (the input), and transforming any speech component of the input to have the tone and timbre of the voice in the prompt, resulting in a continuous audio output stream? (Ideally, while passing through non-speech parts of the input audio stream; but those could also be handled other ways, with traditional source separation techniques, microphone arrays, etc.)

Though I suppose, for the use-case I'm thinking of (v-tubers), you don't really need the ability to dynamically change the prompt; so you could also simplify this to a continuous single-stream "speech to speech" model, which gets its target vocal timbre burned into it during an expensive (but one-time) fine-tuning step.

➕ show 4 replies

LoveMortuus • yesterday at 9:38 AM

This is very cool! And it'll only get better. I do wonder, if, at least as a patch-up job, they could do some light audio processing to remove the raspiness from the voices.

krunck • yesterday at 2:06 AM

I just had some amusing results using text with lots of exclamations and turning up the temperature. Good fun.

woodson • last Thursday at 10:36 PM

Does the 169M include the ~90M params for the Mimi codec? Interesting approach using FiLM for speaker conditioning.

➕ show 1 reply

convivialdingo • last Thursday at 9:37 PM

Impressive! The cloning and voice affect is great. Has a slight warble in the voice on long vowels, but not a huge issue. I'll definitely check it out - we could use voice generation for alerting on one of our projects (no GPUs on hardware).

➕ show 1 reply

lukebechtel • last Thursday at 9:27 PM

Very cool. I'd love a slightly larger version with hopefully improved voice quality.

Nice work!

➕ show 1 reply

elaus • last Thursday at 9:57 PM

Very nice to have done this by yourself, locally.

I wish there was an open/local tts model with voice cloning as good as 11l (for non-english languages even)

➕ show 1 reply

jacquesm • last Thursday at 10:43 PM

What could possibly go wrong...

Don't you ever think about what the balance of good and bad is when you make something like this? What's the upside? What's the downside?

In this particular case I can only see downsides, if there are upsides I'd love to hear about them. All I see is my elderly family members getting 'me' on their phones asking for help, and falling for it.

I've gotten into the habit of waiting for the other person to speak first when I answer the phone now and the number is unknown to me.

➕ show 4 replies

jokethrowaway • yesterday at 11:27 AM

I'm sure it has its uses, but for anything with a higher requirement for quality, I think Vibe Voice is the only real OSS cloning option.

F2/E5 are also very good but have plenty of bad runs, you need to keep re-rolling until you get good outputs.

Gathering6678 • yesterday at 1:54 AM

Emm...I played the sample audio and it was...horrible?

How is it voice cloning if even the sample doesn't sound like any human being...

➕ show 2 replies

sergiotapia • last Thursday at 11:36 PM

It sounds a lot like RFK Jr! Does anyone have any more casual examples?

nunobrito • last Thursday at 10:26 PM

Muito fixe. Now the next challenge (for me) is how to convert this to DART and run on Android. :-)

➕ show 1 reply

jokethrowaway • yesterday at 11:27 AM

Sorry but the quality is too bad.

I'm sure it has its uses, but for anything practical I think Vibe Voice is the only real OSS cloning option. F2/E5 are also very good but has plenty of bad runs, you need to keep re-rolling.

brikym • last Thursday at 10:40 PM

A scammers dream.

➕ show 3 replies

alt Hacker News

Sopro TTS: A 169M model with zero-shot voice cloning that runs on the CPU

Comments