Heh, for a couple last days, I've been doing this exact kind of "neuroanatomy" on Qwen2.5/Qwen3 too. Fascinating stuff. To make it easier to fiddle with the network, I created a small inference engine that is stripped of all the framework magic, just raw matmuls and all (main inference loop is just 50 lines of code!). For example, it's trivial to remove a layer: i just skip it in code with a simple "if". I've found that removing some layers doesn't appear to change anything (based on the vibes at least). If you remove some later layers, the model forgets how to insert the EOS token and keeps chatting ad finitum (still coherently). Removing earliest layers makes the model generate random garbage. Turns out abliteration is not hard to do, 10 examples was enough to find the refusal vector and cancel most refusals. Interestingly, I've found that refusal happens in the middle layers too (I think, layer 12 out of 26)
From what I understand, transformers are resistant to network corruption (without complete collapse) thanks to residual connections.
I tried to repeat some layers too but got garbage results. I guess I need to automate finding the reasoning layers too, instead of just guessing.
Hook it up in autoresearch?