logoalt Hacker News

woodsonlast Thursday at 11:36 PM1 replyview on HN

From your definition:

> a learner observes samples from classes which were not observed during training, and needs to predict the class that they belong to.

That's not what happens in zero-shot voice cloning, which is why I dismissed your definition copied from Wikipedia.


Replies

nateb2022last Thursday at 11:46 PM

> That's not what happens in zero-shot voice cloning

It is exactly what happens. You are confusing the task (classification vs. generation) with the learning paradigm (zero-shot).

In the voice cloning context, the class is the speaker's voice (not observed during training), samples of which are generated by the machine learning model.

The definition applies 1:1. During inference, it is predicting the conditional probability distribution of audio samples that belong to that unseen class. It is "predict[ing] the class that they belong to," which very same class was "not observed during training."

You're getting hung up on the semantics.

show 1 reply