I explored that, again with Devstral, but the execution with 4 times the same circuit lead to less score on the tests.
I chat with the model to see if the thing was still working and seemed coherent to me, I didn't notice anything off.
I need to automate testing like that, where you pick the local maxima and then iterate over that picking layers to see if it's actually better, and then leave the thing running overnight
Can Karpathy's autoresearch be used on this to explore what works and what does not? That is supposed to automate research like this from what I understand.