logoalt Hacker News

coder543last Thursday at 10:24 PM5 repliesview on HN

Why wouldn’t that be one-shot voice cloning? The concept of calling it zero shot doesn’t really make sense to me.


Replies

ben_wlast Thursday at 10:47 PM

Zero-shot means zero-retraining, so think along the lines of "Do you need to modify the weights? Or can you keep the weights fixed and you only need to supply an example?"

As with other replies, yes this is a silly name.

nateb2022last Thursday at 11:32 PM

Providing inference-time context (in this case, audio) is no different than giving a prompt to an LLM. Think of it as analogous to an AGENTS.md included in a prompt. You're not retraining the model, you're simply putting the rest of the prompt into context.

If you actually stopped and fine-tuned the model weights on that single clip, that would be one-shot learning.

show 1 reply
oofbeylast Friday at 1:04 AM

It’s nonsensical to call it “zero shot” when a sample of the voice is provided. The term “zero shot cloning” implies you have some representation of the voice from another domain - e.g. a text description of the voice. What they’re doing is ABSOLUTELY one shot cloning. I don’t care if lots of STT folks use the term this way, they’re wrong.

woodsonlast Thursday at 10:39 PM

I don't disagree, but that's what people started calling it. Zero-shot doesn't make sense anyway, as how would the model know what voice it should sound like (unless it's a celebrity voice or similar included in the training data where it's enough to specify a name).

show 2 replies
geocarlast Thursday at 10:38 PM

So if you get your target to record (say) 1 hour of audio, that's a one-shot.

If you didn't do that (because you have 100 hours of other people talking), that's zero-shots, no?

show 1 reply