We're running out of benchmarks to upper bound AI capabilities

14 points • by gmays • today at 8:16 PM • 3 comments • view on HN

Comments

We can definitely make harder evals, the problem is a good eval set is indistinguishable from good training data / market edge, so no one is incentivized to share their best eval sets publicly.

WarmWash • today at 9:17 PM

Start front loading the models with 5k, 10k, 50k, 100k tokens of messy quasi related context, and then run the benchmarks.

These models are ridiculously powerful with a blank slate. It's when they get loaded down with all the necessary (and inevitably unnecessary) context to complete the task that they really start to crumble and fold.

➕ show 1 reply

alt Hacker News

We're running out of benchmarks to upper bound AI capabilities

Comments