logoalt Hacker News

minimaxirtoday at 4:34 PM3 repliesview on HN

In hindsight I may have been pedantic.


Replies

santiagobasultotoday at 7:29 PM

Not at all, I had the same feeling as yours the first time I read it. I think the key is that the "encoder" they're using is just a linear projection, which is probably pretty fast and memory efficient. A single matmul vs a ViT encoder is probably a huge win.

wilkystyletoday at 4:59 PM

I had a similar thought to you, and found your question and the resulting discussion helpful!

alberto467today at 5:04 PM

Not at all. Getting really pedantic, tokenization is also a form of encoding, so it doesn't matter the modality you're using, you'll end up doing some type of encoding in some way.

show 1 reply