ChatGPT, Claude, Grok and Google AI Overviews, whatever powers the latter, were all used in one or more of these examples, in various configurations. I think they can perform differently, and I often try more than one when the 1st try doesn't work great. I don't think there's any fundamental difference in the principle of their operation, and I think there never will be - there will be another major breakthrough
My hypothesis is that a model fails to switch into a deep thinking mode (if it has it) and blurts whatever it got from all the internet data during autoregressive training. I tested it with alpha-blending example. Gemini 2.5 flash - fails, Gemini 2.5 pro - succeeds.
How presence/absence of a world model, er, blends into all this? I guess "having a consistent world model at all times" is an incorrect description of humans, too. We seem to have it because we have mechanisms to notice errors, correct errors, remember the results, and use the results when similar situations arise, while slowly updating intuitions about the world to incorporate changes.
The current models lack "remember/use/update" parts.
> I don't think there's any fundamental difference in the principle of their operation
Yeah, they seem to be a subject to the universal approximation theorem (it needs to be checked more thoroughly, but I think we can build a transformer that is equivalent to any given fully-connected multilayered network).
That is at a certain size they can do anything a human can do at a certain point in their life (that is with no additional training) regardless of whether humans have world models and what those model are on the neuronal level.
But there are additional nuances that are related to their architectures and training regimes. And practical questions of the required size.
Each of these models has a thinking/reasoning variant and a default non-thinking variant. I would expect the reasoning variants (o3 or “GPT5 Thinking”, Gemini DeepThink, Claude with Extended Thinking, etc) to do better at this. I think there is also some chance that in their reasoning traces they may display something you might see as closer to world modelling. In particular, you might find them explicitly tracking positions of pieces and checking validity.