I’m sure there is not enough data here for this to be statistically significant (it seems to oscillate too much and not show real trends or step changes) - BUT
If this measure were hardened up a little, it would be really useful.
It feels like an analogue to an employee’s performance over time - you could see in the graphs when Claude is “sick” or “hungover”, when Claude picks up a new side hustle and starts completely phoning it in, or when it’s gunning for a promotion and trying extra hard (significant parameter changes). Pretty neat.
Obviously the anthropomorphising is not real, but it is cool to think of the model’s performance as being a fluid thing you have to work with, and that can be measured like this.
I’m sure some people, most, would prefer that the model’s performance were fixed over time. But come on, this is way more fun.