Treating attention as a lookup operation is popular among computational complexity theorists (e.g.

yorwba • today at 5:56 PM • 1 reply • view on HN

Treating attention as a lookup operation is popular among computational complexity theorists (e.g. https://arxiv.org/abs/2310.03817 ) because it's easier to work with when you're explicitly constructing a transformer to perform a particular computation, just to demonstrate that tranformers can, in theory, perform it. That's also why there are no training details: the weights are computed directly and not trained.

Replies

D-Machine • today at 6:15 PM

This is a good link and important (albeit niche) qualification.

It is hard to square with the article's claims about differentiability and otherwise lack of clarity / obscurantism about what they are really doing here (they really are just compiling / encoding a simple computer / VM into a slightly-modified transformer, which, while cool, is really not what they make it sound like at all).

alt Hacker News

Replies