logoalt Hacker News

4bpptoday at 2:43 AM2 repliesview on HN

Assuming the benchmarks are sound (rather than capturing a fluke), the provided explanation still does not pass the smell test. As far as I can tell, there is nothing about the training process of these models that would encourage them to make the output of any layer apart from (n-1) meaningful as the input of layer n, unless perhaps these layers were initialised as identity and the training process did not get to change them much. (Plausible for middle layers?)

Considering this, I think (again, assuming the benchmarks themselves are sound) the most plausible explanation for the observations is (1) the layers being duplicated are close to the identity function on most inputs; (2) something happened to the model in training (RLHF?) that forcefully degraded its reasoning performance; (3) the mechanism causing the degradation involves the duplicated layers, so their duplication has the effect of breaking the reasoning-degrading mechanism (e.g. by clobbering a "refusal" "circuit" that emerged in post-training).

More concisely, I'm positing that this is an approach that can only ever break things, and rather than boosting reasoning, it is selectively breaking things deleterious to reasoning.


Replies

ACCount37today at 11:28 AM

Empirical findings tell a very different tale: all LLM layers use vaguely compatible internal representations. And middle layers in particular can be almost interchangeable - a lot of what they seems to be "iterative refinement of the same representations". Proven by various probes and ablations, but the most obvious one is probably the good old logit lens.

This is likely to be shaped by tied embeddings and skips on one end, and maybe training pressures on the other.

The very top of FF stack and the very bottom of FF stack both reflect the same token embeddings - and this propagates through the model, setting up a shared identity space. Skip connections propagate that through the layers. No explicit shared identity imposed, but there is an implicit one set by the architecture. Fairly well established.

(Now: highly speculative! Attention over past tokens creates an implicit "robustness/convergence" pressure? The model can't be "certain" if it'll have access to the right representations at a given layer, because representations depend not just on the past layers, but also on the highly uncertain contents of previous tokens as passed through attention. Which in turn depends on more of the same, increasing variance further. So the training causes: "each layer can't be certain of what it will have access to, so it develops to refine anything it currently has access to in a convergent fashion, because that's what's useful under pressure of attention-induced uncertainty".)

LLMs are notoriously nonfragile, and robust to perturbations. Far more so if you anneal with SFT/distillation after your model surgery, although this wasn't done here. Plenty of weird franken-LLM experiments prove that empirically.

So I'm not too surprised to find that someone has managed to improve benchmark performance on a few narrow tasks by duplicating a few middle layers. "Duplicating a few layers that were doing convergent iterative refinement benefits a few tasks that suffered from insufficient depth of convergent iterative refinement" is a fairly reasonable hypothesis, in my eyes.

The chances of duplication "breaking something somewhere" are high, and I would expect the capability profile of an unannealed franken-LLM like this to have a few gaps in it if evaluated extensively against the original. But "franken-LLM layer duplication can actually improve some things" is far too plausible with what we know to be dismissed pre-emptively.

show 1 reply
kirill5poltoday at 4:00 AM

Basically all of them are using residual connections so it’s not that surprising honestly