logoalt Hacker News

twtw99yesterday at 6:24 PM7 repliesview on HN

If you don't want to click in, easy comparison with other 2 frontier models - https://x.com/OpenAI/status/2029620619743219811?s=20


Replies

bicxyesterday at 7:47 PM

That last benchmark seemed like an impressive leg up against Opus until I saw the sneaky footnote that it was actually a Sonnet result. Why even include it then, other than hoping people don't notice?

show 2 replies
chabesyesterday at 6:26 PM

Definitely don’t want to click in at x either.

show 4 replies
Aboutplantsyesterday at 6:32 PM

It seems that all frontier models are basically roughly even at this point. One may be slightly better for certain things but in general I think we are approaching a real level playing field field in terms of ability.

show 4 replies
swingboyyesterday at 6:35 PM

Why do so many people in the comments want 4o so bad?

show 4 replies
MarcFrameyesterday at 7:20 PM

how does 5.4-thinking have a lower FrontierMath score than 5.4-pro?

show 2 replies
karmasimidayesterday at 6:29 PM

It is a bigger model, confirmed

dom96yesterday at 7:08 PM

Why do none of the benchmarks test for hallucinations?

show 2 replies