> Visual prompting: Click on the person or object in the video that’s making a sound to isolate their audio.
How does that work? Correlating sound with movement?
Think about it conceptually:
Could you watch a music video and say "that's the snare drum, that's the lead singer, keyboard, bass, that's the truck that's making the engine noise, that's the crowd that's cheering, oh and that's a jackhammer in the background"? So can AI.
Could you point out who is lead guitar and who is rhythm guitar? So can AI.
If it’s anything like the original SAM, thousands of hours of annotator time.
If I had to do it synthetically, take single subjects with a single sound and combine them together. Then train a model to separate them again.