The multimodality of most current popular models is quite limited (mostly text is used to improve ca...

D-Machine • yesterday at 10:21 PM • 1 reply • view on HN

The multimodality of most current popular models is quite limited (mostly text is used to improve capacity in vision tasks, but the reverse is not true, except in some special cases). I made this point below at https://news.ycombinator.com/item?id=46939091

Otherwise, I don't understand the way you are using "conscious" and "unconscious" here.

My main point about conscious reasoning is that when we introspect to try to understand our thinking, we tend to see e.g. linguistic, imagistic, tactile, and various sensory processes / representations. Some people focus only on the linguistic parts and downplay e.g. imagery ("wordcels vs. shape rotators meme"), but in either case, it is a common mistake to think the most important parts of thinking must always necessarily be (1) linguistic, (2) are clearly related to what appears during introspection.

Replies

mirekrusin • yesterday at 11:00 PM

All modern models are processing images internally within its own neural network, they don't delegate it to some other/ocr model. Image data flows through the same paths as text, what do you mean by "quite limited" here?

Your first comment was refering to unconscious, now you don't mention it.

Regarding "conscious and linguistic" which you seem to be touching on now, taking aside multimodality - text itself is way richer for llms than for humans. Trivial example may be ie. mermaid diagram which describes some complex topology, svg which describes some complex vector graphic or complex program or web application - all are textual but to understand and create them model must operate in non linguistic domains.

Even pure text-to-text models have ability to operate in other than linguistic domains, but they are not text-to-text only, they can ingest images directly as well.

➕ show 1 reply

alt Hacker News

Replies