logoalt Hacker News

woodsonlast Thursday at 10:39 PM2 repliesview on HN

I don't disagree, but that's what people started calling it. Zero-shot doesn't make sense anyway, as how would the model know what voice it should sound like (unless it's a celebrity voice or similar included in the training data where it's enough to specify a name).


Replies

nateb2022last Thursday at 11:25 PM

> Zero-shot doesn't make sense anyway, as how would the model know what voice it should sound like (unless it's a celebrity voice or similar included in the training data where it's enough to specify a name).

It makes perfect sense; you are simply confusing training samples with inference context. "Zero-shot" refers to zero gradient updates (retraining) required to handle a new class. It does not mean "zero input information."

> how would the model know what voice it should sound like

It uses the reference audio just like a text based model uses a prompt.

> unless it's a celebrity voice or similar included in the training data where it's enough to specify a name

If the voice is in the training data, that is literally the opposite of zero-shot. The entire point of zero-shot is that the model has never encountered the speaker before.

show 1 reply