logoalt Hacker News

SoftTalkeryesterday at 6:51 PM1 replyview on HN

LLMs are trained on text. Why would we expect them to understand a visual and tactile 3D world?


Replies

azinman2yesterday at 6:53 PM

Because they’re also multimodal vLLMs.