They said they used things like submitted a `conftest.py` - e.g. what would be considered very blata...

jmalicki • today at 12:41 AM • 2 replies • view on HN

They said they used things like submitted a `conftest.py` - e.g. what would be considered very blatant cheating, not just overfitting/benchmaxxing. Did you read the AI slop in the post?

This is basically a paper about security exploits for the benchmarks. This isn't benchmark hacking like having hand coded hot paths for a microbenchmarks, this is hacking like modifying the benchmark computation code itself at runtime.

Replies

davebren • today at 12:49 AM

I get it, but why would anyone trust what these companies say about their model performance anyway. Everyone can see for themselves how well they complete whatever tasks they're interested in.

alt Hacker News

Replies