Frontier models are mostly past the point of human ability to discern whether they are actually bett...

827a • today at 6:42 PM • 11 replies • view on HN

Frontier models are mostly past the point of human ability to discern whether they are actually better or worse than predecessors and competitors. I suspect the benchmarks may also be saturated, or at least past their usefulness.

I personally feel that Anthropic doesn't understand what this means for the frontier labs, and moreover that they might be the only frontier lab that doesn't.

1. Google dropped Gemini 3.5 Flash at IO, delaying the release of 3.5 Pro for a bit (they have said its coming). They also released a refreshed Antigravity, and drew special attention to how cheaply they were able to build their toy operating system to play Doom (less-than $1000 IIRC).

2. OpenAI has dumped everything into Codex, is offering double the token limits for the next few weeks IIRC, and is offering business discounts. Their head of Codex has tweeted that 5.5 is "extremely efficient", implying that they aren't actually losing money on any of this.

3. DeepSeek and other Chinese labs have dropped token pricing to the floor, in some situations as much as 99%.

4. Anthropic releases the next generation of Opus, their most expensive public model, without changing its price. In the background, they hype up Mythos, an even more expensive model.

Anthropic has screwed up where they need to be making investments, and the cracks are starting to show. They've marginally underinvested in the Sonnet line of models for almost a year now, and they've critically underinvested in product. Anthropic made bets on the story of the second half of 2026 being: ultra-frontier, ultra-intelligence. In reality, what's shaping up is that the story will be: Companies rolling back AI spend, efficiency, "95% as good for 15% the price", sophisticated high quality harnesses, cheaper models. Anthropic isn't ready for this world.

Replies

brokencode • today at 6:52 PM

Anthropic’s story over the past year has been nothing but explosive growth that they can’t keep up with, but now they’re suddenly doomed? Seems pretty far fetched to me.

No idea why you’d say they have critically underinvested in product when Claude Code dominates and they’ve also released popular tools like Cowork and integrations for Microsoft products at an incredibly rapid pace.

Cost is becoming more of a factor, and no doubt they’ll work on that. There’s no reason to think they won’t be able to release cheaper models if they optimize for that rather than improving performance.

➕ show 1 reply

jonnycoder • today at 7:12 PM

No, no it's been pretty easy with software engineering. I work on two types of projects and it's very easy to ask claude for a plan, then have gpt 5.5 rip it to shreds and find legit issues, and vice versa. If both 5.5 and claude 4.8 can independently create a plan and both find no critical or high issues, then we will be at that point.

AussieWog93 • today at 9:34 PM

>Frontier models are mostly past the point of human ability to discern whether they are actually better or worse than predecessors and competitors.

Yeah nah, the models' flaws are pretty obvious when you use them. And as a user, you can absolutely know when a flaw disappears or barrier is cleared.

chis • today at 6:58 PM

I think it's probably too soon to say. I certainly still feel that large coding tasks are getting better and better with each model. I'd guess lawyers, doctors, etc feel similarly.

It feels like the only way to push the limits of newer models is with really long context questions that require reasoning. Any short request will naturally just be within the distribution of all the recent models so there isn't a performance difference there.

I think the near future is looking like a bunch of business-critical tasks that scale infinitely with better reasoning, all being done on whatever the most advanced model is at a high cost. Trading stocks, running a business, looking for tax dodges, writing high-performance code. These are all things where there's a tangible return on each jump in reasoning.

➕ show 1 reply

andai • today at 8:28 PM

Tried using everything that isn't Claude and I keep switching back to Claude because even the smarter models give me uglier code, or miss common sense requirements. (And the dumber models give me code that doesn't work properly).

I keep trying to switch to something else but I keep coming back. (Typically after a few days of giving a new model an honest go, and finding myself constantly asking Sonnet to fix its output... Yes, even Sonnet wins on this front! They really do have some kind of special sauce.)

I'm not where most of their money comes from though, and I don't know how universal my experience is.

jsnell • today at 8:33 PM

I'm a bit confused about what point you're trying to make.

Because you seem to be saying that Anthropic not changing the price of Opus is bad, but then two of your positive examples are Gemini 3.5 Flash (which tripled the 3.1 Flash token prices) and GPT-5.5 (which doubled the GPT-5.4 price, and is slightly more expensive per token than Opus).

Is your argument actually that price hikes are good? That doesn't seem to fit with the general tenor of the message.

loeg • today at 7:03 PM

I thought 4.7 was noticeably better than 4.6.

dbgrman • today at 8:32 PM

thats a pretty cynical take. > past the point of human ability to discern whether they are actually better or worse

This is lack of imagination. If you use these models heavily enough, pretty soon you'll hit the edges of their capabilities. The smarter among us are collecting these problems into a personal benchmark and use that to judge model capability. I think this is the right approach, and dare I say, even better than generic benchmarks. To me, it matters less what the benchmark says, and more what my particular problems are.

dyauspitr • today at 6:54 PM

The Chinese stuff is good enough for up to 80% of the frontier on most text tasks but they are significantly worse at code. They just don’t “get” what you’re asking for like Codex and Claude and require so many more iterations to get close to what you need.

➕ show 1 reply

BoorishBears • today at 8:36 PM

All signs point to Opus 4.7 being smaller than 4.6, so I'm not sure all this holds.

You realize gpt-5.5 is also double the price of gpt-5.4, which itself was a price increase too, right?

Labs are divorcing pricing from inference costs.

llmslave • today at 7:06 PM

anthropic is crushing it, this analysis is laughable. they are only constrained by GPUs

alt Hacker News

Replies