In hindsight I may have been pedantic.

minimaxir • today at 4:34 PM • 3 replies • view on HN

Replies

santiagobasulto • today at 7:29 PM

Not at all, I had the same feeling as yours the first time I read it. I think the key is that the "encoder" they're using is just a linear projection, which is probably pretty fast and memory efficient. A single matmul vs a ViT encoder is probably a huge win.

wilkystyle • today at 4:59 PM

I had a similar thought to you, and found your question and the resulting discussion helpful!

alberto467 • today at 5:04 PM

Not at all. Getting really pedantic, tokenization is also a form of encoding, so it doesn't matter the modality you're using, you'll end up doing some type of encoding in some way.

➕ show 1 reply

alt Hacker News

Replies