Tangential question: what do you call transformers-based models that generate images or videos? Are they LLMs? They're not really "language" models. But there's not really an easy term for them. Maybe "image models" and "video models"?