It's really interesting how much the AI harness seems to matter. Going from 48% via Google's official results to 65% is a huge jump. I feel like I'm constantly seeing results that compare models and rarely seeing results that compare harnesses.
Is there a leaderboard out there comparing harness results using the same models?
I really wish there was! I thought of even creating one but it would be conflict of interest
We probably want to compare the cartesian product of model+harness.