logoalt Hacker News

FeepingCreaturetoday at 6:09 AM1 replyview on HN

That's actually how vision language models already work, pretty much.


Replies

stingraycharlestoday at 6:28 AM

Huh? The images are tokenized in the same way language is and it’s just fed into one single model. Not multiple smaller expert models.

Image gets rasterized into smaller pieces (eg 4x4 pixels) and each of those is assigned a token, similarly how text is broken up into tokens. And the whole thing is fed into a single model.

show 1 reply