Haven't watched it yet... ...but, if you have favorite resources on understanding Q & K, ...

profsummergig • yesterday at 7:28 PM • 6 replies • view on HN

Haven't watched it yet...

...but, if you have favorite resources on understanding Q & K, please drop them in comments below...

(I've watched the Grant Sanderson/3blue1brown videos [including his excellent talk at TNG Big Tech Day '24], but Q & K still escape me).

Thank you in advance.

Replies

roadside_picnic • yesterday at 9:13 PM

It's just a re-invention of kernel smoothing. Cosma Shalizi has an excellent write up on this [0].

Once you recognize this it's a wonderful re-framing of what a transformer is doing under the hood: you're effectively learning a bunch of sophisticated kernels (though the FF part) and then applying kernel smoothing in different ways through the attention layers. It makes you realize that Transformers are philosophically much closer to things like Gaussian Processes (which are also just a bunch of kernel manipulation).

0. http://bactra.org/notebooks/nn-attention-and-transformers.ht...

leopd • yesterday at 7:38 PM

I think this video does a pretty good job explaining it, starting about 10:30 minutes in: https://www.youtube.com/watch?v=S27pHKBEp30

➕ show 2 replies

throw310822 • yesterday at 8:35 PM

Have you tried asking e.g. Claude to explain it to you? None of the usual resources worked for me, until I had a discussion with Claude where I could ask questions about everything that I didn't get.

➕ show 1 reply

machinationu • yesterday at 9:51 PM

Q, K and V are a way of filtering the relevant aspects for the task at hand from the token embeddings.

"he was red" - maybe color, maybe angry, the "red" token embedding carries both, but only one aspect is relevant for some particular prompt.

https://ngrok.com/blog/prompt-caching/

red2awn • yesterday at 7:34 PM

Implement transformers yourself (ie in Numpy). You'll never truly understand it by just watching videos.

➕ show 3 replies

bobbyschmidd • yesterday at 8:42 PM

tldr: recursively aggregating packing/unpacking 'if else if (functions)/statements' as keyword arguments that (call)/take them themselves as arguments, with their own position shifting according to the number "(weights)" of else if (functions)/statements needed to get all the other arguments into (one of) THE adequate orders. the order changes based on the language, input prompt and context.

if I understand it all correctly.

implemented it in html a while ago and might do it in htmx sometime soon.

transformers are just slutty dictionaries that Papa Roach and kage bunshin no jutsu right away again and again, spawning clones and variations based on requirements, which is why they tend to repeat themselves rather quickly and often. it's got almost nothing to do with languages themselves and requirements and weights amount to playbooks and DEFCON levels

alt Hacker News

Replies