This is not local but Gemini models can process very long videos and provide description with timestamps if asked for.
https://ai.google.dev/gemini-api/docs/video-understanding#tr...
Nor would it be describing things as they happen, but instead needing pre-processing, so in the end, very different :)
Nor would it be describing things as they happen, but instead needing pre-processing, so in the end, very different :)