Multimodal models aren't really multimodal. The images are mapped to words and then the words are expanded upon by a single mode LLM.
If you didn't know the word "duck", you could still see the duck, hunt the duck, use the ducks feather's for your bedding and eat the duck's meat. You would know it could fly and swim without having to know what either of those actions were called.
The LLM "sees" a thing, identifies it as a "duck", and then depends on a single modal LLM to tell it anything about ducks.
> Multimodal models aren't really multimodal. The images are mapped to words and then the words are expanded upon by a single mode LLM.
I don't think you can generalize like that, it's a big category, not all multimodal models work the same, it's just a label for a model that has multiple modalities after all, not a specific architecture of machine learning models.