logoalt Hacker News

esafaktoday at 3:12 PM4 repliesview on HN

Some of these benchmarks are supposedly easy to game. Which ones should we pay attention to?


Replies

NitpickLawyertoday at 4:58 PM

SWE-REbench should not be gameable. They collect new issues from live repos, and if you check 1-2 months after a model was released, you can get an idea. But even that would be "benchmaxxxable", which is an overloaded term that can mean many things, but the most vanilla interpretation is that with RL you can get a model to follow a certain task pretty well, but it'll get "stuck" on that task type, or "stubborn" when asked similar but sufficiently different tasks. So for swe-rebench that would be "it fixes bugs in these types of repos, under this harness, but ask it to do soemthing else in a repo and you might not get the same results". In a nutshell.

underlinestoday at 3:22 PM

well, your own, unleaked ones, representing your real workloads.

if you can't afford to do that, look at a lot of them, eg. on artificialanalysis.com they merge multiple benchmarks across weighted categories and build an Intelligence Score, Coding Score and Agentic score.

cbg0today at 4:54 PM

None. Try them out with your own typical tasks to see the performance.

WarmWashtoday at 3:33 PM

ARC-AGI 2

GLM 5 scores 5% on the semi-private set, compared to SOTA models which hover around 80%.