Yes, it is possible to do those things and there are benchmarks for testing multimodal models on their ability to do so. Context length is the major limitation but longer videos can be processed in small chunks whose descriptions can be composed into larger scenes.
https://github.com/JUNJIE99/MLVU
https://huggingface.co/datasets/OpenGVLab/MVBench
Ovis and Qwen3-VL are examples of models that can work with multiple frames from a video at once to produce both visual and temporal understanding