This was a little dense for me to grok. Are these well known results or is there an abstract-like summary?
The RYS (repeat yourself) hypothesis that duplicating (the right) layers is enough to improve performance (sorry for not reading closely enough, it's really just stacking the relevant layers?).
The ERD (encoding, reasoning, decoding) layer structure is a relatively robust observation? That the middle layers of the NN will reason in universal space, and this is kinda evidenced by cosine similarities of the hidden states at each layer given similar or dissimilar inputs. And that similar inputs converges by layer 5 and you can kinda watch that happen in the cosine similarities?
This post is incredible and I'm afraid it'll drop off the front page before people engage deeply with it. (The methodology was interesting, maybe there's other big ideas I'm missing.)
Perhaps not widely known but certainly known in LLM research. There was a bunch of these experiments done 2 years ago and what's interesting is that it still seems to work on the latest models.
Though beware that the increased score on math and EQ could lead to other areas scoring less well; would love to see how these models score on all open benchmarks.
I find the RYS result far more surprising than the ERD result. Encode-Reasoning-Decode is after all a very popular way to design neural networks (even an autoencoder is just that without the reasoning step), the same structure emerging from optimization isn't that surprising.
But the methodology to measure it and put numbers on which layers are most involved in encoding/decoding and where the reasoning takes place is very valuable.
The finding that the phases are more cleanly separated in large-ish models is interesting. I wonder what this could mean for embedding models? Usually we take small LLMs and chop off the last couple layers to get an embedding model. But I wonder if you could get better embedding models using something like the first five layers of Qwen3.5-27B, or the first X layers of Kimi K2.5? The methodology in the article seems to give a straight forward way to find the optimal cutting point