It'd certainly be odd if people were recommending old LLMs which score worse, even if marginall...

zamadatix • yesterday at 11:59 PM • 2 replies • view on HN

It'd certainly be odd if people were recommending old LLMs which score worse, even if marginally. That said, 4o is really a lot more usable than you're making it out to be.

The particular benchmark in the example is fungible but you have to pick something to make a representative example. No matter which you pick someone always has a reason "oh, it's not THAT benchmark you should look at". The benchmarks from the charts in the post exhibit the same as described above.

If someone was making new LLMs which were consistently solving Erdos problems at rapidly increasing rates then they'd be showing how it does that rather than showing how it scores the same or slightly better on benchmarks. Instead the progress is more like years since we were surprised LLMs were writing poetry to massage out an answer to one once. Maybe by the end of the year a few. The progress has definitely become very linear and relatively flat compared to roughly the initial 4o release. I'm just hoping that's a temporary thing rather than a sign it'll get even flatter.

Replies

nl • today at 3:51 AM

Progress has not become linear. We've just hit the limits of what we can measure and explain easily.

One year ago coding agents could barely do decent auto-complete.

Now they can write whole applications.

That's much more difficult to show than an ELO score based on how people like emjois and bold text in their chat responses.

Don't forget Llama4 led Lmarena and turned out to be very weak.

refulgentis • today at 12:17 AM

Frankly, this reads as a lot of words that amount to an excuse for using only LMArena, and the rationale is quite clear: it’s for an unrelated argument that isn’t going to ring true to people, especially an audience of programmers who just spent the last year watching the AI go from being able to make coherent file edits to multi hour work.

LMArena is, de facto, a sycophancy and Markdown usage detector.

Two others you can trust, off the top of my head, are LiveBench.ai and Artifical Analysis. Or even Humanity’s Last Exam results. (Though, frankly, I’m a bit suspicious of them. Can’t put my finger on why. Just was a rather rapid hill climb for a private benchmark over the last year.)

FWIW GPT 5.2 unofficial marketing includes the Erdos thing you say isn’t happening.

➕ show 1 reply

alt Hacker News

Replies