David,
Thanks for this research. I remember being stunned when Goliath showed up and .. worked; this feels like under explored research right now.
I've been thinking about implications of this for local generation -- what's really nice about a repeated layer is it takes up no extra memory -- and therefore works well on the edge.
Can you suggest some exploration angles on the edge side? I've recently started looking at fixing expert layers for an entire generation run as interesting - basically you pay the memory cost once for loading in selected experts - and I think RYS type thinking is a natural extension of this. If you've got some ideas, I'm all ears.
Thanks!
I have pushed basic code to GitHub (https://github.com/dnhkng/RYS)
Some interesting areas to explore might be a combination of deleting some layers and duplicating others. i.e. reduce VRAM by dropping some layer (this works, well documented), and recovering performance by duplicating others (saves VRAM). I am not pursuing this, but it seems interesting!
Ever since I read about this, I have been thinking about the next logical step: train a NN to route the internal loops dynamically after each layer. Instead of just choosing a given set of layers that are repeated, let the new classifier decide whether it wants to loop, where it wants to loop, whether to loop multiple times, to loop a big part, or to just jump to the final layers straight away. Each token could loop more or less based on its relevance.
It has some similarities of a MoE architecture, but instead of choosing experts, it chooses layer routes. Training this NN classifier together with the LLM could condense the required amount of layers for a given intelligence down drastically if it works. If anyone wants to work on this, feel free to send me a message.