I didn't get idea of this benchmark, they ask to produce code to replicate result of papers, wh...

riku_iki • 04/02/2025 • 2 replies • view on HN

I didn't get idea of this benchmark, they ask to produce code to replicate result of papers, which already have code on github?..

Replies

kelseyfrog • 04/03/2025

The point is to bootstrap self-improving AI. Once a measurement becomes a goal, model makers target saturating it.

There is a coefficient of intelligence replication ie: Model M with intelligence I_m, can reproduce a model N with intelligence I_n. When (I_n / I_m) > 1 we'll have a runaway intelligence explosion. There are of course several elements in the chain - akin to the Drake equation for intelligent machines - and their combined multiplicative effect determines the overall intelligence of the system. If f(paper) -> code is the weakest part of the chain, it makes sense to target that.

➕ show 1 reply

eightysixfour • 04/02/2025

You don’t see the value of independent replication of findings?

The agent didn’t have access to the code, although they acknowledge it could theoretically be in the training set, even then the original code wouldn’t conform to the structure of the test.

➕ show 1 reply

alt Hacker News

Replies