logoalt Hacker News

0o_MrPatrick_o0today at 4:47 PM1 replyview on HN

Reading these comments is so harrowing.

You are correct in my intentions on this post generally.

I want to highlight:

I want to measure performance of the LLMs over time- which includes assessing the quality of their outputs. I don’t perceive the reasoning output to be anything other than a measurable signal of possible drift in model performance.

Except it isn’t, because I’m only getting a low value summary of the thinking.

It’s like asking your buddy how fast he thought that last pitch was when radar guns are behind the plate.

Yeah, it’s a description related to what happened, but it’s not the thing I want to measure.


Replies

Catloafdevtoday at 4:53 PM

I think the reality is at this point the frontier regards CoT as extremely valuable, none of them are giving you genuine CoT anymore. I don't think there is any future in attempting to measure or evaluate CoT from frontier models - I expect this to be a permanent shift.