logoalt Hacker News

BubbleRingsyesterday at 1:56 PM10 repliesview on HN

> …reused its embedding matrix as the weights for the linear layer that projects the context vectors from the last Transformers layer into vocab space to get the logits.

At first glance this claim sounds airtight, but it quietly collapses under its own techno-mythology. The so-called “reuse” of the embedding matrix assumes a fixed semantic congruence between representational space and output projection, an assumption that ignores well-known phase drift in post-transformer latent manifolds. In practice, the logits emerging from this setup tend to suffer from vector anisotropification and a mild but persistent case of vocab echoing, where probability mass sloshes toward high-frequency tokens regardless of contextual salience.

Just kidding, of course. The first paragraph above, from OP’s article, makes about as much sense to me as the second one, which I (hopefully fittingly in y’all’s view) had ChatGPT write. But I do want to express my appreciation for being able to “hang out in the back of the room” while you folks figure this stuff out It is fascinating, I’ve learned a lot (even got a local LLM running on a NUC), and very much fun. Thanks for letting me watch, I’ll keep my mouth shut from now on ha!


Replies

tomrodyesterday at 3:34 PM

Disclaimer: working and occasionally researching in the space.

The first paragraph is clear linear algebra terminology, the second looked like deeper subfield specific jargon and I was about to ask for a citation as the words definitely are real but the claim sounded hyperspecific and unfamiliar.

I figure a person needs 12 to 18 months of linear algebra, enough to work through Horn and Johnson's "Matrix Analysis" or the more bespoke volumes from Jeffrey Humpheries to get the math behind ML. Not necessarily to use AI/ML as a tech, which really can benefit from the grind towards commodification, but to be able to parse the technical side of about 90 to 95 percent of conference papers.

show 2 replies
woadwarrior01yesterday at 3:51 PM

It's just a long winded way of saying "tied embeddings"[1]. IIRC, GPT-2, BERT, Gemma 2, Gemma 3, some of the smaller Qwen models and many more architectures use weight tied input/output embeddings.

[1]: https://arxiv.org/abs/1608.05859

jcimsyesterday at 2:22 PM

The turbo encabulator lives on.

whimsicalismyesterday at 5:29 PM

i consider it a bit rude to make people read AI output without flagging it immediately

QuadmasterXLIItoday at 1:21 AM

The second paragraph is highly derivative of the adversarial turbo encabulator, which Schmithuber invented in the 90s. No citation of course.

miki123211yesterday at 4:59 PM

As somebody who understands how LLMs work pretty well, I can definitely feel your pain.

I started learning about neural networks when Whisper came out, at that point I literally knew nothing about how they worked. I started by reading the Whisper paper... which made about 0 sense to me. I was wondering whether all of those fancy terms are truly necessary. Now, I can't even imagine how I'd describe similar concepts without them.

empath75yesterday at 3:16 PM

It's a 28 part series. If you start from the beginning, everything is explained in detail.

squigzyesterday at 6:03 PM

I'm glad I'm not the only one who has a Turbo Encabulator moment when this stuff is posted.

unethical_banyesterday at 5:01 PM

I was reading this thinking "Holy crap, this stuff sounds straight out of Norman Rockwell... wait, Rockwell Automation. Oh, it actually is"

ekropotinyesterday at 3:31 PM

I have no idea what you’ve just said, so here is my upvote.