logoalt Hacker News

gpt5today at 5:18 AM6 repliesview on HN

I wish it was true. I would gladly use a GPT 5.2 high model equivalent for coding (6 months old) if it was offered cheaper by Deepseek or Kimi. And I'm sure that's an extremely prevalent opinion by the millions of Claude and Codex users who are bothered by the costs.

However, they just don't perform that well in practice. That's the real issue. You can actually see it when you move away from open benchmarks. Deep seek 3.2 is 4% on Arc-AGI 2 [1], while GPT 5.2 high is 52% and GPT 5.5 pro high is 84.6%. That's the real reason why nobody is using these models for serious work. It's incredibly frustrating.

In addition, I already feel the pain myself on the model restriction. I'll asking my codex 5.5 agent to crawl a website - BOOM, cybersecurity warning on my account. I'll ask it to fix SSH on my local network - another warning. I'm worried about the day my account would be randomly banned and I cannot create a new one. OpenAI already asks you to perform full identification in order to eliminate these warnings - probably exactly for that - so that if they ban you, it's permanent.

[1] https://arcprize.org/leaderboard


Replies

usernametaken29today at 7:46 AM

I worked extensively on ARC AGI before and one thing is SURE as hell. OpenAI and Gemini in particular use this as marketing material. You can correlate the benchmark release with stock price increase. They feed synthetic datasets of ARC into their models to boost the numbers. There is no doubt in my mind Gemini is no better than DeepSeek other than being specifically fine tuned for ARC AGI. Heck, they even say so and they say they have paid annotations for ARC. Again, economic incentives. In terms of whether these models are actually better at the benchmarks, likely not. See ARC 3, where the gap is diminishingly small.

show 3 replies
applfanboysbgontoday at 6:39 AM

> Deep seek 3.2 is 4% on Arc-AGI 2

Why are you bringing up an outdated Chinese model from 6 months ago to compare to a US model from 6 months ago? The outdated Chinese model will have performance from ~12 months ago, obviously. But today's Chinese model DeepSeek 4 has performance not far from the US model 6 months ago; 46% compared to 52% from 5.2.

show 1 reply
shotoday at 5:44 AM

I 100% agree with you, but I've been convinced over the last year that it's a time and scale issue, not anything fundamental.

The Chinese models right now are in a weird spot. Compared to the frontiers, both their pre and post training is woeful - tiny, resource constrained in every dimension including human, slow. I'd compare it to OpenAI 5 years ago except I think even then OpenAI had way more!

But they "cheat" quite a lot in distillation and very benchmark-focussed RL and that's where you get this superficial quality in the leaderboards that doesn't match up when you go off-script. Arc is a great example in that it really belies an "inferior soul" at the heart of it all.

What gives me great hope though is that those same scaling laws that Altman and others have been hyping forever will absolutely kick in for the Chinese labs just as they did for the US ones, and I don't think anything can stop that process now. So they will catch up. It won't be tomorrow, but it's not going to be 10 years either. 3-5 would be my reasonably educated guess.

And the final risk, that China itself might try to restrict availability of the tsunami of GPU or other AI hardware it will inevitably produce - well, I just can't really imagine a country that has been configuring itself for the last 40 years as a single purpose export machine deciding that actually, no, it doesn't want to export something.

About the model restrictions - absolutely. I've been trying to do security research on my own software and the frontier models immediately get suspicious. I've been playing with the local ones much more this year basically because of this. They have deficiencies, for sure - they feel very "hollow" compared to the major labs. But I've talked to a lot of people, and the consensus is pretty clear - just a matter of time.

show 1 reply
ageitgeytoday at 6:36 AM

Have you tried the latest DeepSeek v4 Pro inside of the Claude Code harness? It's not listed in that site.

It definitely 'feels like' it is as good as Claude for many regular web app coding tasks (though I don't have real benchmarks). And it is comically cheap.

I'm not suggesting it is better than the latest Claude or codex models, but it seems 'good enough' for a lot of use cases in my limited real world testing.

show 2 replies
energy123today at 8:13 AM

They're not even that much cheaper (1/2 price per task according to Artificial Analysis) once you account for lower token usage of GPT-5.5. I can't justify it when factoring in the extra time wasted, and the cheap codex usage I get through the monthly plan. Frontier intelligence is not a commodity product ... yet.

otabdeveloper4today at 6:18 AM

And yet Claude six months ago was amazing and good enough for you.

This shows that AI cloud consumption is just a conspicuous consumption status symbol, nobody knows why they need cloud AI or what problem they are even solving.