logoalt Hacker News

We're running out of benchmarks to upper bound AI capabilities

14 pointsby gmaystoday at 8:16 PM3 commentsview on HN

Comments

nikisweetingtoday at 9:48 PM

We can definitely make harder evals, the problem is a good eval set is indistinguishable from good training data / market edge, so no one is incentivized to share their best eval sets publicly.

WarmWashtoday at 9:17 PM

Start front loading the models with 5k, 10k, 50k, 100k tokens of messy quasi related context, and then run the benchmarks.

These models are ridiculously powerful with a blank slate. It's when they get loaded down with all the necessary (and inevitably unnecessary) context to complete the task that they really start to crumble and fold.

show 1 reply