logoalt Hacker News

smusamashahlast Wednesday at 5:40 PM1 replyview on HN

    We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%.

Replies

attentivelast Wednesday at 9:57 PM

"We wished to also evaluate Claude 3.7 Sonnet, but were unable to complete the experiments given rate limits with the Anthropic API."

show 1 reply