Empirical findings tell a very different tale: all LLM layers use vaguely compatible internal repres...

ACCount37 • today at 11:28 AM • 1 reply • view on HN

Empirical findings tell a very different tale: all LLM layers use vaguely compatible internal representations. And middle layers in particular can be almost interchangeable - a lot of what they seems to be "iterative refinement of the same representations". Proven by various probes and ablations, but the most obvious one is probably the good old logit lens.

This is likely to be shaped by tied embeddings and skips on one end, and maybe training pressures on the other.

The very top of FF stack and the very bottom of FF stack both reflect the same token embeddings - and this propagates through the model, setting up a shared identity space. Skip connections propagate that through the layers. No explicit shared identity imposed, but there is an implicit one set by the architecture. Fairly well established.

(Now: highly speculative! Attention over past tokens creates an implicit "robustness/convergence" pressure? The model can't be "certain" if it'll have access to the right representations at a given layer, because representations depend not just on the past layers, but also on the highly uncertain contents of previous tokens as passed through attention. Which in turn depends on more of the same, increasing variance further. So the training causes: "each layer can't be certain of what it will have access to, so it develops to refine anything it currently has access to in a convergent fashion, because that's what's useful under pressure of attention-induced uncertainty".)

LLMs are notoriously nonfragile, and robust to perturbations. Far more so if you anneal with SFT/distillation after your model surgery, although this wasn't done here. Plenty of weird franken-LLM experiments prove that empirically.

So I'm not too surprised to find that someone has managed to improve benchmark performance on a few narrow tasks by duplicating a few middle layers. "Duplicating a few layers that were doing convergent iterative refinement benefits a few tasks that suffered from insufficient depth of convergent iterative refinement" is a fairly reasonable hypothesis, in my eyes.

The chances of duplication "breaking something somewhere" are high, and I would expect the capability profile of an unannealed franken-LLM like this to have a few gaps in it if evaluated extensively against the original. But "franken-LLM layer duplication can actually improve some things" is far too plausible with what we know to be dismissed pre-emptively.

Replies

4bpp • today at 12:31 PM

That's interesting, could you point me to some source on these findings?

It seems to me that the difference between "iterative improvement" as you put it and "close to the identity" (as in the output is close to the input for most of the volume of the input space) as I put it is fairly subtle, anyway. One experiment I would like to see is what happens to the reasoning performance if rather than duplicating the selected layers, they are deleted/skipped entirely. If the layers improve reasoning by iterative improvement, this should make the performance worse; but if they contain a mechanism that degrades reasoning and is not robust against unannealed self-composition, it should make the performance similarly better.

alt Hacker News

Replies