logoalt Hacker News

SpaceManNabstoday at 5:25 PM1 replyview on HN

> No transcription, no frame captioning, no intermediate text.

If there is text on the video (like a caption or wtv), will the embedding capture that? Never thought about this before.

If the video has audio, does the embedding capture that too?


Replies

sohamrjtoday at 5:33 PM

Yes to both. The embedding is over raw video frames, so anything visible (text, signs, captions) gets captured in the vector. And Gemini Embedding 2 extracts the audio track and embeds it alongside the visual frames. So a query like 'someone yelling' would theoretically match on audio. My dashcam footage doesn't have audio though, so I haven't tested that side yet.