logoalt Hacker News

heyjamesknightyesterday at 10:41 PM1 replyview on HN

Multimodal models aren't really multimodal. The images are mapped to words and then the words are expanded upon by a single mode LLM.

If you didn't know the word "duck", you could still see the duck, hunt the duck, use the ducks feather's for your bedding and eat the duck's meat. You would know it could fly and swim without having to know what either of those actions were called.

The LLM "sees" a thing, identifies it as a "duck", and then depends on a single modal LLM to tell it anything about ducks.


Replies

embedding-shapetoday at 12:24 AM

> Multimodal models aren't really multimodal. The images are mapped to words and then the words are expanded upon by a single mode LLM.

I don't think you can generalize like that, it's a big category, not all multimodal models work the same, it's just a label for a model that has multiple modalities after all, not a specific architecture of machine learning models.