>but it didn't perform well in our coding and reasoning testing
>Comprehensive evaluation results at https://gertlabs.com/rankings
But if you go to the linked site, it seems like the only thing that's part of the evaluation is how well the models play various games? I suppose that counts as "reasoning", but I don't see how coding ability tested?
Games is loosely defined here, as we run the bench across hundreds of unique environments. For some, the models write code to play a game, either one-shot or via a harness where they can iterate and use tools. Some they play directly, making a decision on each game tick. Some are real-time, giving the models a harness where they can write code handlers or submit decisions to interact with environments directly.
Coding is what we test for most heavily. Testing this via a game format (instead of correct/incorrect answers) allows us to score code objectively, scale to smarter models, and directly compare performance to other models. When we built the first iteration last year, I was surprised by how well it mapped to subjective experience with using models for coding. Games really are great for measuring intelligence.