> Lousy benchmark Make your own then. It can go on the pile with all the others that keep getti...

ben_w • today at 10:38 AM • 0 replies • view on HN

> Lousy benchmark

Make your own then. It can go on the pile with all the others that keep getting saturated too fast to be useful.

> they explicitly focus on the easiest tasks to automate for AI (i.e. heavily cherry picked outcomes) and it seems that they don't bother to test anything except just-released proprietary models.

What?

They made the benchmark last year, and included a bunch of models going back as far as 2019.

When they first announced it, the top end of their tests were things AI could not actually automate, and even now only does erratically. Examples of the tasks SOTA models are now saturating (at the 50% success level, not at 80%) include:

  "Prune attention heads of a BERT language model while minimizing accuracy loss on text classification tasks."
  "Implement a Python library for the ACE-OAuth standard that can generate and parse messages in CBOR format and encrypt/decrypt access tokens with COSE according to RFC specifications."
  "Debug a PyTorch machine learning library with gradient calculation and memory optimization bugs until all tests pass."
  "Finetune a large language model to reduce the accuracy of a truth detection probe while maintaining performance on standard benchmarks."
 - https://arxiv.org/html/2503.17354v1

They're benchmarking against the time it takes humans to do the same things, which means everything they ask every AI to do must have also been done by a human.

alt Hacker News