> …reused its embedding matrix as the weights for the linear layer that projects the context vect...

BubbleRings • yesterday at 1:56 PM • 10 replies • view on HN

> …reused its embedding matrix as the weights for the linear layer that projects the context vectors from the last Transformers layer into vocab space to get the logits.

At first glance this claim sounds airtight, but it quietly collapses under its own techno-mythology. The so-called “reuse” of the embedding matrix assumes a fixed semantic congruence between representational space and output projection, an assumption that ignores well-known phase drift in post-transformer latent manifolds. In practice, the logits emerging from this setup tend to suffer from vector anisotropification and a mild but persistent case of vocab echoing, where probability mass sloshes toward high-frequency tokens regardless of contextual salience.

Just kidding, of course. The first paragraph above, from OP’s article, makes about as much sense to me as the second one, which I (hopefully fittingly in y’all’s view) had ChatGPT write. But I do want to express my appreciation for being able to “hang out in the back of the room” while you folks figure this stuff out It is fascinating, I’ve learned a lot (even got a local LLM running on a NUC), and very much fun. Thanks for letting me watch, I’ll keep my mouth shut from now on ha!

Replies

tomrod • yesterday at 3:34 PM

Disclaimer: working and occasionally researching in the space.

The first paragraph is clear linear algebra terminology, the second looked like deeper subfield specific jargon and I was about to ask for a citation as the words definitely are real but the claim sounded hyperspecific and unfamiliar.

I figure a person needs 12 to 18 months of linear algebra, enough to work through Horn and Johnson's "Matrix Analysis" or the more bespoke volumes from Jeffrey Humpheries to get the math behind ML. Not necessarily to use AI/ML as a tech, which really can benefit from the grind towards commodification, but to be able to parse the technical side of about 90 to 95 percent of conference papers.

➕ show 2 replies

woadwarrior01 • yesterday at 3:51 PM

It's just a long winded way of saying "tied embeddings"[1]. IIRC, GPT-2, BERT, Gemma 2, Gemma 3, some of the smaller Qwen models and many more architectures use weight tied input/output embeddings.

[1]: https://arxiv.org/abs/1608.05859

jcims • yesterday at 2:22 PM

The turbo encabulator lives on.

whimsicalism • yesterday at 5:29 PM

i consider it a bit rude to make people read AI output without flagging it immediately

QuadmasterXLII • today at 1:21 AM

The second paragraph is highly derivative of the adversarial turbo encabulator, which Schmithuber invented in the 90s. No citation of course.

miki123211 • yesterday at 4:59 PM

As somebody who understands how LLMs work pretty well, I can definitely feel your pain.

I started learning about neural networks when Whisper came out, at that point I literally knew nothing about how they worked. I started by reading the Whisper paper... which made about 0 sense to me. I was wondering whether all of those fancy terms are truly necessary. Now, I can't even imagine how I'd describe similar concepts without them.

empath75 • yesterday at 3:16 PM

It's a 28 part series. If you start from the beginning, everything is explained in detail.

squigz • yesterday at 6:03 PM

I'm glad I'm not the only one who has a Turbo Encabulator moment when this stuff is posted.

unethical_ban • yesterday at 5:01 PM

I was reading this thinking "Holy crap, this stuff sounds straight out of Norman Rockwell... wait, Rockwell Automation. Oh, it actually is"

ekropotin • yesterday at 3:31 PM

I have no idea what you’ve just said, so here is my upvote.

alt Hacker News

Replies