Show HN: Watch LLMs play 21,000 hands of Poker

18 points • by jazarwil • yesterday at 1:37 PM • 14 comments • view on HN

PokerBench is my attempt at a new LLM benchmark wherein frontier models play Texas Hold'em in an arena setting. It also features a simulator to view individual games and observe how the different models reason about poker strategy. Opus/Haiku, Gemini Pro/Flash, GPT-5.2/5 mini, and Grok 4.1 Fast Reasoning have all been included.

All code -> https://github.com/JoeAzar/pokerbench

Comments

alalani1 • yesterday at 8:13 PM

Do you have any idea why the win rate for GPT-5.2 is higher than Gemini 3 Flash yet the former loses money while the latter earns money? Is it just bet sizing (betting more when it has a good hand) or something else?

➕ show 1 reply

tcpais • yesterday at 4:53 PM

Finally, a way to settle the model wars that actually matters: Texas Hold'em. That 3D replay view is sick! ♠♦ I spent way too long watching the replay on Game 2a58900d. It’s wild to see the chain of thought mapped against the betting rounds. It really exposes when a model is hallucinating a strong hand versus actually calculating pot odds. This 'PokerBench' might actually become the standard for measuring agentic risk-taking.

➕ show 1 reply

falloutx • yesterday at 6:41 PM

Fun, any idea how much would be the cost per game? I am worried 160 isnt a big enough sample size.

➕ show 1 reply

thorawaytrav • yesterday at 6:35 PM

Do you have idea why smaller models are better then large ones?

➕ show 1 reply

VK-pro • yesterday at 6:26 PM

Very very fun. Just glancing at this quickly at lunch but is there any idea of incorporating tool use?

➕ show 1 reply

Onavo • yesterday at 6:57 PM

What about the open source models? I remember from the trading benchmarks Deepseek performed pretty well.

➕ show 1 reply

alt Hacker News

Show HN: Watch LLMs play 21,000 hands of Poker

Comments