I like the inclusion of the graph at the end to compare progress. It would be cool to compare this d...

cg5280 • yesterday at 4:38 PM • 1 reply • view on HN

I like the inclusion of the graph at the end to compare progress. It would be cool to compare this directly to competing models (Claude, GPT, etc).

Replies

kqr • yesterday at 4:50 PM

It would unfortunately also need several runs of each to be reliable. There's nothing in TFA to indicate the results shown aren't to a large degree affected by random chance!

(I do think from personal benchmarks that Gemini 3 is better for the reasons stated by the author, but a single run from each is not strong evidence.)

➕ show 1 reply

alt Hacker News

Replies