Some of these benchmarks are supposedly easy to game. Which ones should we pay attention to?

esafak • today at 3:12 PM • 4 replies • view on HN

Replies

SWE-REbench should not be gameable. They collect new issues from live repos, and if you check 1-2 months after a model was released, you can get an idea. But even that would be "benchmaxxxable", which is an overloaded term that can mean many things, but the most vanilla interpretation is that with RL you can get a model to follow a certain task pretty well, but it'll get "stuck" on that task type, or "stubborn" when asked similar but sufficiently different tasks. So for swe-rebench that would be "it fixes bugs in these types of repos, under this harness, but ask it to do soemthing else in a repo and you might not get the same results". In a nutshell.

underlines • today at 3:22 PM

well, your own, unleaked ones, representing your real workloads.

if you can't afford to do that, look at a lot of them, eg. on artificialanalysis.com they merge multiple benchmarks across weighted categories and build an Intelligence Score, Coding Score and Agentic score.

cbg0 • today at 4:54 PM

None. Try them out with your own typical tasks to see the performance.

WarmWash • today at 3:33 PM

ARC-AGI 2

GLM 5 scores 5% on the semi-private set, compared to SOTA models which hover around 80%.

alt Hacker News

Replies