logoalt Hacker News

The Illustrated Transformer

191 pointsby aurahamtoday at 7:15 PM42 commentsview on HN

Comments

libraryofbabeltoday at 9:50 PM

I read this article back when I was learning the basics of transformers; the visualizations were really helpful. Although in retrospect knowing how a transformer works wasn't very useful at all in my day job applying LLMs, except as a sort of deep background for reassurance that I had some idea of how the big black box producing the tokens was put together, and to give me the mathematical basis for things like context size limitations etc.

I would strongly caution anyone who thinks that they will be able to understand or explain LLM behavior better by studying the architecture closely. That is a trap. Big SotA models these days exhibit so much nontrivial emergent phenomena (in part due to the massive application of reinforcement learning techniques) that give them capabilities very few people expected to ever see when this architecture first arrived. Most of us confidently claimed even back in 2023 that, based on LLM architecture and training algorithms, LLMs would never be able to perform well on novel coding or mathematics tasks. We were wrong. That points towards some caution and humility about using network architecture alone to reason about how LLMs work and what they can do. You'd really need to be able to poke at the weights inside a big SotA model to even begin to answer those kinds of questions, but unfortunately that's only really possible if you're a "mechanistic interpretability" researcher at one of the major labs.

Regardless, this is a nice article, and this stuff is worth learning because it's interesting for its own sake! Right now I'm actually spending some vacation time implementing a transformer in PyTorch just to refresh my memory of it all. It's a lot of fun! If anyone else wants to get started with that I would highly recommend Sebastian Raschka's book and youtube videos as way into the subject: https://github.com/rasbt/LLMs-from-scratch .

Has anyone read TFA author Jay Alammar's book (published Oct 2024) and would they recommend it for a more up-to-date picture?

show 3 replies
some_guy_nobeltoday at 11:24 PM

Great article, must be the inspiration for the recent Illustrated Evo 2: https://research.nvidia.com/labs/dbr/blog/illustrated-evo2/

boltzmann_today at 7:41 PM

Kudos also to Transformer Explainer team for putting some amazing visualizations https://poloclub.github.io/transformer-explainer/ It really clicked to me after reading this two and watching 3blue1brown videos

show 1 reply
laser9today at 7:41 PM

Here's the comment from the author himself (jayalammar) talking about other good resources on learning Transformers:

https://news.ycombinator.com/item?id=35990118

Koshkintoday at 8:51 PM

(Going on a tangent.) The number of transformer explanations/tutorials is becoming overwhelming. Reminds me of monads (or maybe calculus). Someone feels a spark of enlightenment at some point (while, often, in fact, remaining deeply confused), and an urge to share their newly acquired (mis)understanding with a wide audience.

show 2 replies
zkmontoday at 9:32 PM

I think the internal of transformers would become less relevant like internal of compilers, as programmers would only care about how to "use" them instead of how to develop them.

show 2 replies
gustavoaca1997today at 7:49 PM

I have this book. Really a life savior to help me catching up a few months ago when my team decided to use LLMs in our systems.

show 1 reply
profsummergigtoday at 7:28 PM

Haven't watched it yet...

...but, if you have favorite resources on understanding Q & K, please drop them in comments below...

(I've watched the Grant Sanderson/3blue1brown videos [including his excellent talk at TNG Big Tech Day '24], but Q & K still escape me).

Thank you in advance.

show 6 replies
ActorNightlytoday at 8:56 PM

People need to get away from this idea of Key/Query/Value as being special.

Whereas a standard deep layer in a network is matrix * input, where each row of the matrix is the weights of the particular neuron in the next layer, a transformer is basically input* MatrixA, input*MatrixB, input*MatrixC (where vector*matrix is a matrix), then the output is C*MatrixA*MatrixB*MatrixC. Just simply more dimensions in a layer.

And consequently, you can represent the entire transformer architecture with a set of deep layers as you unroll the matricies, with a lot of zeros for the multiplication pieces that are not needed.

This is a fairly complex blog but it shows that its just all matrix multiplication all the way down. https://pytorch.org/blog/inside-the-matrix/.

show 1 reply