Furthermore, I think we care most about the context surrounding the humans.
If a txt2vid model could generate a 100% perfect video of a soccer match, perfectly rendering each blade of grass, would anyone watch it? No, because we care about the team and the stories of the players. Not just the spectacle being shown.