> if you wanted imagination you don't need to make a video model. You probably don't need to decode the latents at all.
Soft disagree. What is the purpose of that imagination if not to map it to actual real world outfcomes. For this to compare them to the real world and possibly backpropagate through them you'll need video frames.