> Zero-shot learning (ZSL) is a problem setup in deep learning where, at test time, a learner observes samples from classes which were not observed during training, and needs to predict the class that they belong to. The name is a play on words based on the earlier concept of one-shot learning, in which classification can be learned from only one, or a few, examples.
https://en.wikipedia.org/wiki/Zero-shot_learning
edit: since there seems to be some degree of confusion regarding this definition, I'll break it down more simply:
We are modeling the conditional probability P(Audio|Voice). If the model samples from this distribution for a Voice class not observed during training, it is by definition zero-shot.
"Prediction" here is not a simple classification, but the estimation of this conditional probability distribution for a Voice class not observed during training.
Providing reference audio to a model at inference-time is no different than including an AGENTS.md when interacting with an LLM. You're providing context, not updating the model weights.
I think the point is it's not zero shot if a sample is needed. A system that require one sample is usually considered one-shot, or few-shot if it needs few, etc etc.
This generic answer from Wikipedia is not very helpful in this context. Zero-shot voice cloning in TTS usually means that data of the target speaker you want the generated speech to sound like does not need to be included in the training data used to train the TTS models. In other words, you can provide an audio sample of the target speaker together with the text to be spoken to generate the audio that sounds like it was spoken by that speaker.