I think one of the biggest problems is the models are trained on 2D sequences and don't have any understanding of what they're actually seeing. They see some structure of pixels shift in a frame and learn that some 2D structures should shift in a frame over time. They don't actually understand the images are 2D capture of an event that occurred in four dimensions and the thing that's been imaged is under the influence of unimaged forces.
I saw a Santa dancing video today and the suspension of disbelief was almost instantly dispelled when the cuffs of his jacket moved erratically. The GenAI was trying to get them to sway with arm movements but because it didn't understand why they would sway it just generated a statistical approximation of swaying.
GenAI also definitely doesn't understand 3D structures easily demonstrated by completely incorrect morphological features. Even my dogs understand gravity, if I drop an object they're tracking (food) they know it should hit the ground. They also understand 3D space, if they stand on their back legs they can see over things or get a better perspective.
I've yet to see any GenAI that demonstrates even my dogs' level of understanding the physical world. This leaves their output in the uncanny valley.