Fascinating! I wonder if new training techniques could emerge from this. If we say layer-1=translater, layer2-5=reasoner, layer6 retranslater, could we train small 6 layer models but evaluate their performance in a 1>n*(2-5)>6 setup to directly train towards optimal middle-layers that can be looped? You'd only have to train 6 layers but get the duplication-benefit of the middle layers for free.