logoalt Hacker News

fi-lelast Saturday at 11:43 AM1 replyview on HN

Good point. The architectural solution that would come to mind is 2D text embeddings, i.e. we add 2 sines and cosines to each token embedding instead of 1. Apparently people have done it before: https://arxiv.org/abs/2409.19700v2


Replies

ninjhalast Saturday at 12:14 PM

I think I remember one of the original ViT papers saying something about 2D embeddings on image patches not actually increasing performance on image recognition or segmentation, so it’s kind of interesting that it helps with text!

E: I found the paper: https://arxiv.org/pdf/2010.11929

> We use standard learnable 1D position embeddings, since we have not observed significant performance gains from using more advanced 2D-aware position embeddings (Appendix D.4).

Although it looks like that was just ImageNet so maybe this isn't that surprising.

show 1 reply