We evaluate several frontier models on PaperBench, finding that the best-performin...

smusamashah • last Wednesday at 5:40 PM • 1 reply • view on HN

    We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%.

Replies

attentive • last Wednesday at 9:57 PM

"We wished to also evaluate Claude 3.7 Sonnet, but were unable to complete the experiments given rate limits with the Anthropic API."

➕ show 1 reply

alt Hacker News

Replies