Which makes sense for text LLMs yes, but what about LLMs that deal with images? How can you tell they wouldn't work without words? It just happens to be words we use for interfacing with them, because it's easy for us to understand, but internally they might be conceptualizing things in a multitude of ways.
Multimodal models aren't really multimodal. The images are mapped to words and then the words are expanded upon by a single mode LLM.
If you didn't know the word "duck", you could still see the duck, hunt the duck, use the ducks feather's for your bedding and eat the duck's meat. You would know it could fly and swim without having to know what either of those actions were called.
The LLM "sees" a thing, identifies it as a "duck", and then depends on a single modal LLM to tell it anything about ducks.