It’s not difficult to hack this together with CLIP. I did this with about a tenth of my movie collection last week with a GTX 1080 - though it lacks temporal understanding so you have to do the scene analysis yourself
Would you be willing to share more details of what you did?
I'm guessing you're not storing the CLIP for every single frame, instead of every second or so? Also, are you using the cosine similarity? How are you finding the nearest vector?