This is incredibly fascinating.
I feel like one round of RL could potentially fix "short circuits" like these. It seems to be convinced that a particular rule isn't "allowed," when it's totally fine. Wouldn't that mean that you just have to fine tune it a bit more on its reasoning path?
I believe this comes from our verbiage.
If I asked you, "hey. How many Rs in strawberry?". You're going to tell me 2, because the likelihood is I am asking about the ending Rs. That's at least how I'd interpret the question without the "llm test" clouding my vision.
Same for if I asked how many gullible. I'd say "it's a double L after the u".
It's my guess this has muddled the training data.