> So if you get your target to record (say) 1 hour of audio, that's a one-shot.
No, that would still be zero shot. Providing inference-time context (in this case, audio) is no different than giving a prompt to an LLM. Think of it as analogous to an AGENTS.md included in a prompt. You're not retraining the model, you're simply putting the rest of the prompt into context.
If you actually stopped and fine-tuned the model weights on that single clip, that would be one-shot learning.
> So if you get your target to record (say) 1 hour of audio, that's a one-shot.
No, that would still be zero shot. Providing inference-time context (in this case, audio) is no different than giving a prompt to an LLM. Think of it as analogous to an AGENTS.md included in a prompt. You're not retraining the model, you're simply putting the rest of the prompt into context.
If you actually stopped and fine-tuned the model weights on that single clip, that would be one-shot learning.