Since the library tools are just an MCP server, I did some testing on ChatGPT and Claude where I don't have to pay for api credits.
With maximum thinking and web search to look up magic rules, I didn't ever see it make a mistake. It is probably better at following the rules than the average magic player (but not better at making the most strategic moves).
The benchmark was mostly to find out what is the cheapest model with the lowest reasoning effort would provide a good experience for the app. The answer turned out to be that, for now, there is no cost effective way to run this app.
To provide a good experience, the simulations either need to be near instant, or you need to be able to run dozens or hundreds of simulations in parallel and do statistical analysis.