logoalt Hacker News

lavaman131today at 5:16 AM1 replyview on HN

This is a really interesting benchmark and also timely given a lot of existing benchmarks don't do a good job. The mechanics and edge cases seem notoriously difficult to parse also even for perhaps human players. Have you been also plugging these into newer reasoning models to see how providing them with thinking time improves their win rate against the baseline?


Replies

CallumFergtoday at 5:50 AM

Since the library tools are just an MCP server, I did some testing on ChatGPT and Claude where I don't have to pay for api credits.

With maximum thinking and web search to look up magic rules, I didn't ever see it make a mistake. It is probably better at following the rules than the average magic player (but not better at making the most strategic moves).

The benchmark was mostly to find out what is the cheapest model with the lowest reasoning effort would provide a good experience for the app. The answer turned out to be that, for now, there is no cost effective way to run this app.

To provide a good experience, the simulations either need to be near instant, or you need to be able to run dozens or hundreds of simulations in parallel and do statistical analysis.