I will beat loudly on the "Attention is a reinvention of Kernel Smoothing" drum until it i...

roadside_picnic • last Thursday at 12:52 AM • 14 replies • view on HN

I will beat loudly on the "Attention is a reinvention of Kernel Smoothing" drum until it is common knowledge. It looks like Cosma Schalizi's fantastic website is down for now, so here's a archive link to his essential reading on this topic [0].

If you're interested in machine learning at all and not very strong regarding kernel methods I highly recommending taking a deep dive. Such a huge amount of ML can be framed through the lens of kernel methods (and things like Gaussian Processes will become much easier to understand).

0. https://web.archive.org/web/20250820184917/http://bactra.org...

Replies

libraryofbabel • last Thursday at 1:32 AM

This is really useful, thanks. In my other (top-level) comment, I mentioned some vague dissatisfactions around how in explanations of attention the Q, K, V matrices always seem to be pulled out of a hat after being motivated in a hand-wavy metaphorical way. The kernel methods treatment looks much more mathematically general and clean - although for that reason maybe less approachable without a math background. But as a recovering applied mathematician ultimately I much prefer a "here is a general form, now let's make some clear assumptions to make it specific" to a "here's some random matrices you have to combine in a particular way by murky analogy to human attention and databases."

I'll make a note to read up on kernels some more. Do you have any other reading recommendations for doing that?

➕ show 3 replies

LudwigNagasena • last Thursday at 2:52 PM

> Such a huge amount of ML can be framed through the lens of kernel methods

And none of them are a reinvention of kernel methods. There is such a huge gap between the Nadaraya and Watson idea and a working Attention model, calling it a reinvention is quite a reach.

One might as well say that neural networks trained with gradient descent are a reinvention of numerical methods for function approximation.

➕ show 1 reply

mbeex • last Thursday at 9:34 AM

Site is still fine (but is and was always http-only):

http://bactra.org/notebooks/nn-attention-and-transformers.ht...

niemandhier • last Thursday at 5:52 PM

In physics we call these things „duality“, depending on the problem one can choose different perspectives on the subject.

Things proven for one domain can than be pulled back to the other domain along the arrows of duality connections.

lambdaone • last Thursday at 8:58 AM

The archive link above is broken: this is an earlier archived copy of that page with content intact:

https://web.archive.org/web/20230713101725/http://bactra.org...

auntienomen • last Thursday at 6:20 PM

This might be the single best blog post I've ever read, both in terms of content and style.

Y'all should read this, and make sure you read to the end. The last paragraph is priceless.

lugu • last Thursday at 11:42 AM

I don't understand what motivate the need for w1 and w2, except if we accept the premise that we are doing attention in the query and key spaces... Which is not the thesis of the author. What am I missing?

Surprisingly, reading this piece helped me better understand the query, key metaphor.

MontyCarloHall • last Thursday at 2:31 AM

It's utterly baffling to me that there hasn't been more SOTA machine learning research on Gaussian processes with the kernels inferred via deep learning. It seems a lot more flexible than the primitive, rigid dot product attention that has come to dominate every aspect of modern AI.

➕ show 5 replies

aquafox • last Thursday at 6:33 AM

Oh wow, I wish I could give more than one upvote for this reference!

D-Machine • last Thursday at 2:29 AM

Yes, this needs to be linked more, you are doing a great service.

esafak • last Thursday at 1:25 AM

(How) do you find that framing enlightening?

➕ show 1 reply

somethingsome • last Thursday at 1:13 AM

Hey, can I contact you somehow?

alt Hacker News

Replies