PaperBench

100 points • by meetpateltech • yesterday at 5:06 PM • 25 comments • view on HN

Comments

no_multitudes • yesterday at 9:12 PM

Are there examples of the outputs the LLMs under test generated? I couldn't find any detailed ones in the paper or code.

The result here seems to be "Our Judge LLM gave another LLM a 21% grade for some code it generated", which is ... not qualitatively meaningful at all to me.

smusamashah • yesterday at 5:40 PM

    We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%.

➕ show 1 reply

tetris11 • today at 3:31 PM

Sounds like a good initiative, but not one that should be under the ownership of a for-profit company with a massive stake in the race.

amelius • yesterday at 8:55 PM

One thing I'd be interested in is a UI for reading papers with AI assistance.

➕ show 2 replies

timabdulla • yesterday at 7:54 PM

What were the human PhDs able to do after more than 48 hours of effort? Presumably given that these are top-level PhDs, the replication success rate would be close to 100%?

➕ show 1 reply

riku_iki • yesterday at 8:53 PM

I didn't get idea of this benchmark, they ask to produce code to replicate result of papers, which already have code on github?..

➕ show 1 reply

DrillShopper • yesterday at 7:26 PM

PaperBench sounds like a benchmarking software package for recently released GPUs.

➕ show 1 reply

antonkar • today at 12:49 AM

There is a planet-wise eternal 100% safe AI solution that can be a billion dollar startup, too:

Put all the GPUs in cloud/s controlled by international scientists (now you can use your GPU on any device, can earn money by renting it when you don’t need it, nothing changes except you need to be online to us it, but we’ll have 5G and better worldwide. You can develop, sell or release free math-proven safe AI models in this cloud “AI App Store”, etc).

Because the main risk is an AI agent botnet - current GPUs are like nukes that are 100% unprotected - any hacker can make a virus with AI agent component just to steal money, this AI will be not aligned at all, will become a per perpetual and eventually autonomous botnet.

alt Hacker News

PaperBench

Comments