This lines up with what I have seen doing CKA (centered kernel alignment) analysis on transformer internals. The middle layers in most large models have surprisingly similar representations to their neighbors, so duplicating them is basically giving the model extra compute cycles in a region where it is already doing useful refinement without messing up the input/output encoding stages. Curious whether picking layers by representation similarity instead of just a contiguous block would do even better.
Have a look at the boundaries in the heatmaps.
They are of course open to interpretation, but it suggest to me that the models develop 'organs' for processing different types of data, and without duplicating the 'whole organ' you don't get the benefits.
This is quite different to what you usually see, which is via layer ablation experiments. Thoughts?