logoalt Hacker News

nighthawk454today at 5:21 AM0 repliesview on HN

Some potentially related stuff on the topic:

Anisotropy in word embeddings dates back to at least 2017 with word2vec - where there were zero layers.

The cone-shaped anisotropy in transformers is known since at least Gao et al. 2019. That lineage explained it fairly intuitively as an artifact of word frequency and softmax geometry (so a training dynamic).

A variety of papers followed up by adding post-hoc ‘whitening’ steps (from classical statistics/NLP), then adding regularizers to the loss to penalize the anisotropy, eventually penalizing the covariance matrix (a la VICReg), and then the SIGReg method as a computationally much cheaper way to approximate the full covariance.

As another commenter pointed out it’s also similar to the InfoNCE/contrastive learning objectives. Where terms to increase uniformity (spread out evenly) on the hyper sphere were added. Like the SimCSE (Gao 2021) paper or the excellent alignment/uniformity breakdown from Wang & Isola 2020.

This proposed dispersion loss seems to be similar in that it pushes things apart by penalizing cosine similarity. Although this one works on the tokens within one sequence. Usually contrastive methods mean pool the sequences and then contrast against the other pooled sequences in the batch.