> Multimodal models aren't really multimodal. The images are mapped to words and then the wo...

embedding-shape • yesterday at 12:24 AM • 0 replies • view on HN

> Multimodal models aren't really multimodal. The images are mapped to words and then the words are expanded upon by a single mode LLM.

I don't think you can generalize like that, it's a big category, not all multimodal models work the same, it's just a label for a model that has multiple modalities after all, not a specific architecture of machine learning models.

alt Hacker News