> Visual prompting: Click on the person or object in the video that’s making a sound to isolate t...

yjftsjthsd-h • today at 2:28 AM • 2 replies • view on HN

> Visual prompting: Click on the person or object in the video that’s making a sound to isolate their audio.

How does that work? Correlating sound with movement?

Replies

If it’s anything like the original SAM, thousands of hours of annotator time.

If I had to do it synthetically, take single subjects with a single sound and combine them together. Then train a model to separate them again.

yodon • today at 2:43 AM

Think about it conceptually:

Could you watch a music video and say "that's the snare drum, that's the lead singer, keyboard, bass, that's the truck that's making the engine noise, that's the crowd that's cheering, oh and that's a jackhammer in the background"? So can AI.

Could you point out who is lead guitar and who is rhythm guitar? So can AI.

➕ show 1 reply

alt Hacker News

Replies