logoalt Hacker News

cgearhartyesterday at 1:19 AM3 repliesview on HN

Why is there an expectation that “nearby” tokens are relevant to increase the information in the similarities? That seems like it would hold true within individual words, but the whole point of attention was to solve long range dependencies. Reintroducing local windows seems like a step backwards to me.


Replies

sdenton4yesterday at 1:26 AM

Maybe it's helpful to find the right point in the long context, but then have easy access to the local structure around that point.

eg, yes, the magically relevant point is the third word of the fifth paragraph on page 183 of the document, but then having a good representation of all of that page is more helpful than the single word.

jsennyesterday at 1:57 AM

This doesn’t answer your question, but one thing to keep in mind is that past the very first layer, every “token” position is a weighted average of every previous position, so adjacency isn’t necessarily related to adjacent input tokens.

A borderline tautological answer might be “because the network learns that putting related things next to each other increases the usefulness of the convolutions”

energy123yesterday at 4:03 AM

It's a little more inductive bias. That's not necessarily a step backwards. You need the right amount of inductive bias for a given data size and model capacity, no more and no less. Transformers already make the inductive bias of temporal locality by being causal.