Did you try repeating the same mid-layer block more than once?
If the gain comes from giving the model another pass over its internal representation, I'd expect some sort of diminishing-returns curve as you add more repeats. But if those layers form a spevific circuit, running it multiple times might actually break the computation.
It would be really interesting to see which of those regims the model falls into.
Yes!
I tried that pretty early on, the its basically never good. Its described in the the section: https://dnhkng.github.io/posts/rys/#the-beginning-of-llm-neu...