logoalt Hacker News

dnhkngtoday at 4:10 PM1 replyview on HN

Have a look at the boundaries in the heatmaps.

They are of course open to interpretation, but it suggest to me that the models develop 'organs' for processing different types of data, and without duplicating the 'whole organ' you don't get the benefits.

This is quite different to what you usually see, which is via layer ablation experiments. Thoughts?


Replies

doctorpanglosstoday at 4:59 PM

Maybe you are observing artifacts of Qwen's training procedure. Perhaps they initialized further layers with the weights of previous ones as part of the training curriculum. But it's fun to imagine something more exotic.

show 1 reply