The point of this post isn’t that the “reasoning” phase of LLM thinking isn’t the same as what humans consider reasoning; it’s that Anthropic is intentionally hiding Claude’s “reasoning output” to make the model harder to distill.
You are correct in my intentions on this post generally.
I want to highlight:
I want to measure performance of the LLMs over time- which includes assessing the quality of their outputs. I don’t perceive the reasoning output to be anything other than a measurable signal of possible drift in model performance.
Except it isn’t, because I’m only getting a low value summary of the thinking.
It’s like asking your buddy how fast he thought that last pitch was when radar guns are behind the plate.
Yeah, it’s a description related to what happened, but it’s not the thing I want to measure.
Reading these comments is so harrowing.
You are correct in my intentions on this post generally.
I want to highlight:
I want to measure performance of the LLMs over time- which includes assessing the quality of their outputs. I don’t perceive the reasoning output to be anything other than a measurable signal of possible drift in model performance.
Except it isn’t, because I’m only getting a low value summary of the thinking.
It’s like asking your buddy how fast he thought that last pitch was when radar guns are behind the plate.
Yeah, it’s a description related to what happened, but it’s not the thing I want to measure.