Huh? The images are tokenized in the same way language is and it’s just fed into one single model. N...

stingraycharles • today at 6:28 AM • 1 reply • view on HN

Huh? The images are tokenized in the same way language is and it’s just fed into one single model. Not multiple smaller expert models.

Image gets rasterized into smaller pieces (eg 4x4 pixels) and each of those is assigned a token, similarly how text is broken up into tokens. And the whole thing is fed into a single model.

Replies

FeepingCreature • today at 7:51 AM

Yes I'm saying

> Imagine face recognition to work like a text chat, where the PC gets the frame from the camera and writes in the chat: "Who's that? Here's the RGB888 image in hex: ...".

that's p much how it works.

➕ show 1 reply

alt Hacker News

Replies