This makes me think LLMs would be interesting to set up in a game of Diplomacy, which is an entirely text-based game which soft rather than hard requires a degree of backstabbing to win.
The findings in this game that the "thinking" model never did thinking seems odd, does the model not always show it's thinking steps? It seems bizarre that it wouldn't once reach for that tool when it must be being bombarded with seemingly contradictory information from other players.
It’s been done before
https://every.to/diplomacy (June 2025)
Reading more I'm a little disappointed that the write-up has seemingly leant so heavily on LLMs too, because it detracts credibility from the study itself.
https://noambrown.github.io/papers/22-Science-Diplomacy-TR.p...