logoalt Hacker News

cpardtoday at 5:43 PM3 repliesview on HN

Benchmarks/evals are really hard and they become harder when there’s huge incentive to game them at an industry scale.

ELT-Bench is another recent example. It was the first serious attempt at a benchmark for data engineering workloads, published about a year ago.

A few days ago, a follow-up paper from a group that includes one of the original authors audited the benchmark itself. The team gfound that the benchmark has structural issues that biased results.

Here’s the paper: https://arxiv.org/abs/2603.29399

None of these are new though, the industry has gone through all that before just in a smaller scale and there’s a lot to learn from that. Here’s a post I wrote on the parallels we see today to what happened with the benchmarketing wars of the database systems.

https://www.typedef.ai/blog/from-benchmarketing-to-benchmaxx...


Replies

softwaredougtoday at 6:23 PM

It’s just hard to make them not part of the training data. We see this a bit with BrowseComp plus and other deep research datasets. Not because frontier labs are trying to cheat, but just from training on the full web.

You need new datasets perpetually.

show 2 replies
fnordpiglettoday at 6:08 PM

Database benchmarks are another.

I have empirical experience though building classifiers that can have no precision measurement because the classifier performs invariably better than humans. They become the state of the art benchmark themselves and can’t be benchmarked except against themselves. These are for tasks that are non trivial and complex, but less logical than coding and less sustained reasoning. There may come a day though, when there is no calibrated benchmark that is independent of the models it’s measuring.

operatingthetantoday at 6:28 PM

Would creating new benchmarks every month solve this problem?

show 1 reply