This is correct and even image generation models aren't really trained for comprehension of image composition yet.
Even the models based off danbooru and E621 still aren't the best at that. And us furries like to tag art in detail.
The best we can really do at the moment is regional prompting, perhaps they need something similar for video.