You don't explain how scoring works, maybe it's obvious to MTG players? If you're using gpt 5.5, is there a possibility that it is biased in favour of models that think the way it does?
The scoring is just based on a simple prompt which is given the game state at the start and end of the turn and the log of tool calls and the final turn summary. The prompt asks it to evaluate the quality of the simulation from 0 to 10, and to give pass or fail for if it is legal.
It is far from ideal, but from my testing, even underpowered small LLMs that could not complete a single legal turn were reasonably good at judging if a simulation was legal. The final judging was all done by gpt-5.5 (medium) which might have given the OpenAI models an advantage, but from all the simulations I looked at, it seemed pretty fair.
This benchmark ended up be more of a test of how well an LLM can call tools without contradicting itself or backtracking. Most of the failures were not because of breaking magic rules, but because it could not sequence the tool calls correctly.
The failure mode seems to be that some models are overly trained to start tool calls, even when the model itself knows that it should not be calling the tool. Both of those examples were not errors because the judge prompt said they were illegal. In both of those examples the model stopped the simulation itself knowing that it made a tool error.
The Opus 4.8 examples are especially weird because it will consistently make the same tool call error 2 or 3 times in a row, and it will put things like "placeholder" or "noop" for the tool call reason.
The scoring is just based on a simple prompt which is given the game state at the start and end of the turn and the log of tool calls and the final turn summary. The prompt asks it to evaluate the quality of the simulation from 0 to 10, and to give pass or fail for if it is legal.
It is far from ideal, but from my testing, even underpowered small LLMs that could not complete a single legal turn were reasonably good at judging if a simulation was legal. The final judging was all done by gpt-5.5 (medium) which might have given the OpenAI models an advantage, but from all the simulations I looked at, it seemed pretty fair.
This benchmark ended up be more of a test of how well an LLM can call tools without contradicting itself or backtracking. Most of the failures were not because of breaking magic rules, but because it could not sequence the tool calls correctly.
For example: https://app.mtgautodeck.com/public/benchmarks/6349dda2-4069-...
and: https://app.mtgautodeck.com/public/benchmarks/dcc18bd8-339d-...
The failure mode seems to be that some models are overly trained to start tool calls, even when the model itself knows that it should not be calling the tool. Both of those examples were not errors because the judge prompt said they were illegal. In both of those examples the model stopped the simulation itself knowing that it made a tool error.
The Opus 4.8 examples are especially weird because it will consistently make the same tool call error 2 or 3 times in a row, and it will put things like "placeholder" or "noop" for the tool call reason.