That’s a devastating benchmark design flaw. Sick of these bullshit benchmarks designed solely to hyp...

emp17344 • yesterday at 10:22 PM • 3 replies • view on HN

That’s a devastating benchmark design flaw. Sick of these bullshit benchmarks designed solely to hype AI. AI boosters turn around and use them as ammo, despite not understanding them.

Replies

famouswaffles • yesterday at 10:52 PM

Relax. Anyone who's genuinely interested in the question will see with a few searches that LLMs can play chess fine, although the post-trained models mostly seem to be regressed. Problem is people are more interested in validating their own assumptions than anything else.

https://arxiv.org/abs/2403.15498

https://arxiv.org/abs/2501.17186

https://github.com/adamkarvonen/chess_gpt_eval

dwohnitmok • today at 1:16 AM

> That’s a devastating benchmark design flaw

I think parent simply missed until their later reply that the benchmark includes rated engines.

runarberg • yesterday at 10:50 PM

I like this game between grok-4.1-fast and maia-1100 (engine, not LLM).

https://chessbenchllm.onrender.com/game/37d0d260-d63b-4e41-9...

This exact game has been played 60 thousand times on lichess. The peace sacrifice Grok performed on move 6 has been played 5 million times on lichess. Every single move Grok made is also the top played move on lichess.

This reminds me of Stefan Zweig’s The Royal Game where the protagonist survived Nazi torture by memorizing every game in a chess book his torturers dropped (excellent book btw. and I am aware I just committed Godwin’s law here; also aware of the irony here). The protagonist became “good” at chess, simply by memorizing a lot of games.

➕ show 1 reply

alt Hacker News

Replies