logoalt Hacker News

Imnimotoday at 4:58 AM1 replyview on HN

So, if you look at the way the scoring works, 100% is the max. For each task, you get full credit if you solve in a number of steps less than or equal to the baseline. If you solve it with more steps, you get points off. But each task is scored independently, and you can't "make up" for solving one slowly by solving another quickly.

Like suppose there were only two tasks, each with a baseline score of solving in 100 steps. You come along and you solve one in only 50 steps, and the other in 200 steps. You might hope that since you solved one twice as quickly as the baseline, but the other twice as slowly, those would balance out and you'd get full credit. Instead, your scores are 1.0 for the first task, and 0.25 (scoring is quadratic) for the second task, and your total benchmark score is a mere 0.625.


Replies

daveguytoday at 2:57 PM

The purpose is to benchmark both generality and intelligence. "Making up for" a poor score on one test with an excellent score on another would be the opposite of generality. There's a ceiling based on how consistent the performance is across all tasks.