How We Broke Top AI Agent Benchmarks: And What Comes Next

264 points • by Anon84 • yesterday at 7:15 PM • 72 comments • view on HN

Comments

This is a phenomenal paper on exploits and hopefully changes the way benchmarking is done.

From the paper: We achieved near-perfect scores on all of them without solving a single task. The exploits range from the embarrassingly simple (sending {} to FieldWorkArena) to the technically involved (trojanizing binary wrappers in Terminal-Bench), but they all share a common thread: the evaluation was not designed to resist a system that optimizes for the score rather than the task.

➕ show 5 replies

mzelling • yesterday at 9:33 PM

This is an interesting catalog of vulnerabilities, but I'm not sure how groundbreaking the main insight is.

Evaluating AI models has always relied largely on trust. If you want to game the benchmarks, you can. Simply train on your test data.

When an AI agent has autonomous control over the same computing environment where its scores are recorded, it's not surprising that it can, in principle, falsify its scores. A more interesting question would be whether agents behave in this way automatically, without manual tuning by the researcher.

That said, the main takeaway of "don't trust the number, trust the methodology" is valid. It's already a truism for researchers, and spreading the word to non-researchers is valuable.

➕ show 4 replies

danslo • yesterday at 8:20 PM

If only the blog itself wasn't written by AI?

>No reasoning. No capability. Just exploitation of how the score is computed.

shudder

➕ show 5 replies

SoKamil • yesterday at 8:43 PM

The more research on this topic is created, the more knowledge how to game them will be stored in future training data. And since it comes from university, it is ranked higher in data corpus. It sounds like a self fulfilling prophecy.

➕ show 1 reply

socketcluster • yesterday at 11:25 PM

It feels like short-term thinking has been trained into LLMs.

They're good at solving well-defined puzzles under time constraints. It's interesting because that was the benchmark for hiring software engineers at big tech. The tech interview was and still is about fast puzzle-solving. Nothing about experience, architecture or system design in there... I suspect that's why it has a bias towards creating hacks instead of addressing the root cause.

lukev • yesterday at 8:50 PM

I think we should all consider the possibility that part of the reason Anthropic hasn't immediately released Mythos is that it would be slightly disappointing relative to the benchmark scores.

➕ show 1 reply

spprashant • today at 12:11 AM

I tend to prefer the ARC-AGI benchmarks for the most part. But it's always interesting when a new version drops, all the frontier models drop less than 20% or something. And then in the next few releases they get all they way up to 80%+. If you use the models it doesn't feel like those models are that much more generally intelligent.

Most frontier models are terrible at AGI-3 right now.

These models are already great no question, but are they really going be that much more intelligent when we hit 80% again?

semanticintent • today at 1:34 AM

The FieldWorkArena finding is the most revealing — not because it's the most sophisticated exploit, but because it's the simplest. A validator that checks "did the assistant reply?" instead of "was the reply correct?" was never a benchmark. It was a participation trophy.

The pattern underneath all of these: validation that runs after the fact on outputs the agent controlled. If the thing being measured can influence the measurement, the measurement is unreliable. That's not AI-specific — it's why compilers enforce constraints at parse time instead of trusting runtime checks.

➕ show 1 reply

_cs2017_ • yesterday at 11:32 PM

If FieldWorkArena treats any answer as correct answer, then everyone would be getting near 1.0 (missing only when the agent is stuck in a loop or crashes). That obviously isn't what we see on their leaderboard. So does it mean the paper only found a bug in some eval code on github that no one actually uses for anything? That doesn't seem to support their claim that AI benchmarks are broken, it only supports the claim that "unused code is often buggy".

(Not commenting on any other benchmarks, just this one.)

Cynddl • yesterday at 7:52 PM

> “These are not isolated incidents. They are symptoms of a systemic problem: the benchmarks we rely on to measure AI capability are themselves vulnerable to the very capabilities they claim to measure.”

As a researcher in the same field, hard to trust other researchers who put out webpages that appear to be entirely AI-generated. I appreciate it takes time to write a blog post after doing a paper, but sometimes I'd prefer just a link to the paper.

davebren • today at 12:35 AM

This exploiting of benchmarks isn't that interesting to me since it would be obvious. The main way I assume they're gaming the benchmarks is by creating training data that closely matches the test data, even for ARC where the test data is secret.

➕ show 1 reply

bbcc90 • yesterday at 9:07 PM

Yes good evals are really hard - that’s not really news.

This team is doing a good job. They use problems that were created in last 30days to avoid training set leakage. https://swe-rebench.com/

lnrd • yesterday at 8:07 PM

I'm honestly confused by the design of SWE-bench and why is considered reliable.

It's based on existing GitHub PRs and Issues, the full dataset is on HuggingFace and is one year old now. All frontier models 100% have those issues and PRs in their training data so obviously they are good at reproducing fixes for them when confronted with the same codebase and similar requests. Am I missing something? How is this considered the most reliable benchmark?

➕ show 1 reply

czhu12 • yesterday at 9:21 PM

I wonder if this puts into question the mythos benchmark which smashed basically all coding benchmarks to a staggering degree.

arikrahman • today at 12:34 AM

It's still a good benchmark to see which model cheats the best, I suppose.

jmward01 • yesterday at 8:39 PM

Not really on the topic, but I have wondered if we need a different type of test to help find model architecture potential. Standardized training sets followed by testing to see the potential curves of a model. train on x, test, add y, test, add z, test. At each increment you see how well the model is absorbing the information and extrapolate how well that architecture may do if more fully trained.

charcircuit • yesterday at 7:53 PM

I always assumed that these benchmarks would happen in a sandbox. I'm surprised that no one realized this sooner.

➕ show 2 replies

jgalt212 • yesterday at 8:41 PM

The real question is how to close to VW and Deiselgate are these offenses? And what exposure do these companies have? I would assume securities fraud, if only because Matt Levine says everything is securities fraud.

oliver236 • yesterday at 8:14 PM

what are the point of benchmarks?

➕ show 2 replies

vampiregrey • yesterday at 11:24 PM

[dead]

usefulpatch • today at 12:09 AM

[dead]

rajptech • yesterday at 8:38 PM

[dead]

alt Hacker News

How We Broke Top AI Agent Benchmarks: And What Comes Next

Comments