logoalt Hacker News

gertlabstoday at 1:12 AM0 repliesview on HN

While this benchmark has interesting results, the "Contamination free" label only works for the initial release of the benchmark. It still has the same fundamental design issues of any other benchmark-- there's a single correct answer for tasks. It looks to be largely saturated upon release.

What they did well: normalizing the harness to mini-swe-agent -- models should be able to generalize to different tools at this point. When they struggle to do that (like most Google models), they're unlikely to be useful in practice. And that kind of generalization is an inherent part of intelligence.

For a benchmark that scales, you need to remove the ceiling and provide environments with measurable goals that are NOT a single correct answer, and sufficiently complex evaluation criteria to scale well beyond the current frontier.

We do this by running multi-agent simulations with large action spaces at https://gertlabs.com/rankings.

We're still relatively unknown in the benchmarking space, but by rotating the pool of environments and ensuring the optimal strategies in the environments themselves are affected by other agents participating in the space, we expect we'll be able to resist contamination as major labs start investing more effort to climb the leaderboard. We've already seen Chinese labs taking an interest.