And here lies the exact issue. Single tests don’t provide any meaningful insights. You need to perform this test at least twenty times in separate chat windows or via the API to obtain meaningful statistics.
For the "Alice in Wonderland" paper, neither Claude-3.5 nor o1-preview was available at that time.
But I have tested them as well a few weeks ago with the issue translated into German, achieving also a 100% success rate with both models.
However, when I add irrelevant information (My mother ...), Claude's success rate drops to 85%:
"My mother has a sister called Alice. Alice has 2 sisters and 1 brother. How many sisters does Alice's brother have?"
We do have chatbot arena which to a degree already does this.
I like to use:
"Kim's mother is Linda. Linda's son is Rachel. John is Kim's daughter. Who is Kim's son?"
Interestingly I just got a model called "engine test" that nailed this one in a three sentence response, whereas o1-preview got it wrong (but has gotten it right in the past).
You also need a problem that hasn't been copy pasted a million times on the internet.
Your experience makes me think that the reason the models got a better success rate is not because they are better at reasoning, but rather because the problem made it to their training dataset.