It's funny because this simple excercise shows all the problems that I have using the reasoning models: they give a long reasoning that just takes too much time to verify and still can't be trusted.
Strawberry is "difficult" not because the reasoning is difficult, but because tokenization doesn't let the model reason at the level of characters. That's why it has to work so hard and doesn't trust its own conclusions.
I may be looking at this too deeply, but I think this suggests that the reasoning is not always utilized when forming the final reply.
For example, IMMEDIATELY, upon it's first section of reasoning where it starts counting the letters:
> R – wait, is there another one? Let me check again. After the first R, it goes A, W, B, E, then R again, and then Y. Oh, so after E comes R, making that the second 'R', and then another R before Y? Wait, no, let me count correctly.
1. During its counting process, it repeatedly finds 3 "r"s (at positions 3, 8, and 9)
2. However, its intrinsic knowledge that "strawberry" has "two Rs" keeps overriding this direct evidence
3. This suggests there's an inherent weight given to the LLM's intrinsic knowledge that takes precedence over what it discovers through step-by-step reasoning
To me that suggests an inherent weight (unintended pun) given to its "intrinsic" knowledge, as opposed to what is presented during the reasoning.