I believe they’re just classifying all models into “reasoning models” eg o3 vs “non reasoning models” eg 4o and just doing a comparison of total tokens (input tokens + hidden reasoning output tokens + shown output tokens)
that's exactly right!
that's exactly right!