Reading these comments is so harrowing.
You are correct in my intentions on this post generally.
I want to highlight:
I want to measure performance of the LLMs over time- which includes assessing the quality of their outputs. I don’t perceive the reasoning output to be anything other than a measurable signal of possible drift in model performance.
Except it isn’t, because I’m only getting a low value summary of the thinking.
It’s like asking your buddy how fast he thought that last pitch was when radar guns are behind the plate.
Yeah, it’s a description related to what happened, but it’s not the thing I want to measure.
I think the reality is at this point the frontier regards CoT as extremely valuable, none of them are giving you genuine CoT anymore. I don't think there is any future in attempting to measure or evaluate CoT from frontier models - I expect this to be a permanent shift.