I find the RYS result far more surprising than the ERD result. Encode-Reasoning-Decode is after all a very popular way to design neural networks (even an autoencoder is just that without the reasoning step), the same structure emerging from optimization isn't that surprising.
But the methodology to measure it and put numbers on which layers are most involved in encoding/decoding and where the reasoning takes place is very valuable.
The finding that the phases are more cleanly separated in large-ish models is interesting. I wonder what this could mean for embedding models? Usually we take small LLMs and chop off the last couple layers to get an embedding model. But I wonder if you could get better embedding models using something like the first five layers of Qwen3.5-27B, or the first X layers of Kimi K2.5? The methodology in the article seems to give a straight forward way to find the optimal cutting point