Yes I'm saying
> Imagine face recognition to work like a text chat, where the PC gets the frame from the camera and writes in the chat: "Who's that? Here's the RGB888 image in hex: ...".
that's p much how it works.
But that isn’t a specialized model like the grandparent claimed, but rather a single, multi-modal model.
But that isn’t a specialized model like the grandparent claimed, but rather a single, multi-modal model.