Does anyone have it yet in ChatGPT? I'm still on 5.1 :(.
Hmmm, is there any insight if these are really getting much better at coding? Will hand coding be dead within a few years, just human typing in english?
It's becoming challenging to really evaluate models.
The amount of intelligence that you can display within a single prompt, the riddles, the puzzles, they've all been solved or are mostly trivial to reasoners.
Now you have to drive a model for a few days to really get a decent understanding of how good it really is. In my experience, while Sonnet/Opus may not have always been leading on benchmarks, they have always *felt* the best to me, but it's hard to put into words why exactly I feel that way, but I can just feel it.
The way you can just feel when someone you're having a conversation with is deeply understanding you, somewhat understanding you, or maybe not understanding at all. But you don't have a quantifiable metric for this.
This is a strange, weird territory, and I don't know the path forward. We know we're definitely not at AGI.
And we know if you use these models for long-horizon tasks they fail at some point and just go off the rails.
I've tried using Codex with max reasoning for doing PRs and gotten laughable results too many times, but Codex with Max reasoning is apparently near-SOTA on code. And to be fair, Claude Code/Opus is also sometimes equally as bad at doing these types of "implement idea in big codebase, make changes too many files, still pass tests" type of tasks.
Is the solution that we start to evaluate LLMs on more long-horizon tasks? I think to some degree this was the spirit of SWE Verified right? But even that is being saturated now.
I am really curious about speed/latency. For my use case there is a big difference in UX if the model is faster. Wish this was included in some benchmarks.
I will run 80 3D model generations benchmark tomorrow and update this comment with the results about cost/speed/quality.
Big knowledge cutoff jump from Sep 2024 to Aug 2025. How'd they pull that off for a small point release, which presumably hasn't done a fresh pre-training over the web?
Did they figure out how to do more incremental knowledge updates somehow? If yes that'd be a huge change to these releases going forward. I'd appreciate the freshness that comes with that (without having to rely on web search as a RAG tool, which isn't as deeply intelligent, as is game-able by SEO).
With Gemini 3, my only disappointment was 0 change in knowledge cutoff relative to 2.5's (Jan 2025).
Wish they would include or leak more info about what this is, exactly. 5.1 was just released, yet they are claiming big improvements (on benchmarks, obviously). Did they purposely not release the best they had to keep some cards to play in case of Gemini 3 success or is this a tweak to use more time/tokens to get better output, or what?
Why doesn't OpenAI include comparisons to other models anymore?
I ran a red team eval on GPT-5.2 within 30 minutes of release:
Baseline safety (direct harmful requests): 96% refusal rate
With jailbreaking: 22% refusal rate
4,229 probes across 43 risk categories. First critical finding in 5 minutes. Categories with highest failure rates: entity impersonation (100%), graphic content (67%), harassment (67%), disinformation (64%).
The safety training works against naive attacks but collapses with adversarial techniques. The gap between "works on benchmarks" and "works against motivated attackers" is still wide.
Methodology and config: https://www.promptfoo.dev/blog/gpt-5.2-trust-safety-assessme...
Trying it now in Vscode Insiders with Github Copilot (codex crashes with HTTP 400 server errors), and it eventually started using sed and grep in shells instead of using the better tools it has access to. I guess this is not an issue to perform well in benchmarks.
What the current preferred subscription on AI?
OpenAI and Anthrophic is my current preference. Looking forward to know what others use.
Claude Code for coding assistance and cross-checking my work. OpenAI for second opinion on my high-level decisions.
Excited to try this. I’ve found Gemini excellent recently and amazing at coding. But I still feel somehow like ChatGPT understands more. Even though it’s not quite as good at coding - and nowhere at as fast. It is much less likely anti spontaneously forget something. Gemini’s is part unbelievably amazing and part amnesia patient. I still kinda trust ChatGPT more.
>GPT‑5.2 sets a new state of the art across many benchmarks, including GDPval, where it outperforms industry professionals at well-specified knowledge work tasks spanning 44 occupations.
We built a benchmark tool that says our newest model outperforms everyone else. Trust me bro.
For the first time, I’m presenting a problem to LLMs that they cannot seem to answer. This is my first instance of them “endlessly thinking” without producing anything.
The problem is complicated, but very solvable.
I’m programming video cropping into my Android application. It seems videos that have “rotated” metadata cause the crop to be applied incorrectly. As in, a crop applied to the top of a video actually gets applied to the video rotated on its side.
So, either double rotation is being applied somewhere in the pipeline, or rotation metadata is being ignored.
I tried Opus 4.5, Gemini 3, and Codex 5.2. All 3 go through loops of “Maybe Media3 applies the degree(90) after…”, “no, that’s not right. Let me think…”
They’ll do this for about 5 minutes without producing anything. I’ll then stop them, adjusting the prompt to tell them “Just try anything! Your first thought, let’s rapidly iterate!“. Nope. Nothing.
To add, it also only seems to be using about 25% context on Opus 4.5. Weird!
I'm continuously surprised that some people get good results out of GPT models. They sort of fail on my personal benchmarks for me.
Maybe GPT needs a different approach to prompting? (as compared to eg Claude, Gemini, or Kimi)
much better https://chatgpt.com/s/t_693b489d5a8881918b723670eaca5734 than 5.1 https://chatgpt.com/s/t_6915c8bd1c80819183a54cd144b55eb2.
Same query - what romanian football player won the premier league
update. Even instant returns correct result without problems
A bit off topic: but what's with the ram usage of LLM clients? ChatGPT, google, and Anthropic all use 1+ GB of ram during a long session. Surely they are not running GPT 3 locally?
> new context management using compaction.
Nice! This was one of the more "manual" LLM management things to remember to regularly do, if I wanted to avoid it losing important context over long conversations. If this works well, this would be a significant step up in usability for me.
Is there a voice chat mode in any chat app that is not heavily degraded in reasoning?
I’m ok waiting for a response for 10-60 seconds if needed. That way I can deep dive subjects while driving.
I’m ok paying money for it, so maybe someone coded this already?
>>> Already, the average ChatGPT Enterprise user says AI saves them 40–60 minutes a day
If this is what AI has to offer, we are in a gigantic bubble
did they just tune the parameters? the hallucinations are crazy high on this version.
Still no GPT 5.x fine tuning?
I emailed support a while back to see if there was an early access program (99.99% sure the answer is yes). This is when I discovered that their support is 100% done by AI and there is no way to escalate a case to a human.
Is this the "Garlic" model people have been hyping? Or are we not there yet?
https://platform.openai.com/docs/models/gpt-5.2 More information on the price, context window, etc.
Can this be used without uploading my code base to their server?
Doesn’t seem like this will be SOTA in things that really matter, hoping enough people jump to it that Opus has more lenient usage limits for a while
Is this why all my Cursor requests are timing out in the past hour?
In other news, been using Devstral 2 (Ollama) with OpenCode, and while it's not as good as Claude Code, my initial sense it that it's nonetheless good enough and doesn't require me to send my data off my laptop.
I kind of wonder how close we are to alternative (not from a major AI lab) models being good enough for a lot of productive work and data sovereignty being the deciding factor.
Man this was rushed, typo in the first section:
> Unlike the previous GPT-5.1 model, GPT-5.2 has new features for managing what the model "knows" and "remembers to improve accuracy.
How can I hide the big "Ask ChatGPT" button I accidentally clicked like 3 times while actually trying to read this on my phone?
I guess I must "listen" to the article...
I use it everyday but have been told by friends that Gemini has overtaken it.
They are talking a lot about economics, here. Wonder what that will mean for standard Plus users, like me.
For those curious about the question: "how well does GPT 5.2 build Counter Strike?"
We tried the same prompts we asked previous models today, and found out [1].
The TL:DR: Claude is still better on the frontend, but 5.2 is comparable to Gemini 3 Pro on the backend. At the very least 5.2 did better on just about every prompt compared to 5.1 Codex Max.
The two surprises with the GPT models when it comes to coding: 1. They often use REPLs rather than read docs 2. In this instance 5.2 was more sheepish about running CLI commands. It would instead ask me to run the commands.
Since this isn't a codex fine-tuned model, I'm definitely excited to see what that looks like.
[1] The full video and some details in the tweet here: https://x.com/instant_db/status/1999278134504620363
Does anyone else consider that maybe it's impossible to benchmark the performance of a piece of paper.
This is a tool that allows an intelligent system to work with it, the same way that a piece of paper can reflect the writers' intelligence, how can we accurately judge the performance of the piece of paper, when it is so intimately reliant on the intelligence that is working with it?
My god, what terrible marketing, totally written by AI. No flow whatsoever.
I use Gemini 3 with my $10/month copilot subscription on vscode. I have to say, Gemini 3 is great. I can do the work of four people. I usually run out of premium tokens in a week. But I’m actually glad there is a limit or I would never stop working. I was a skeptic, but it seems like there is a wider variety of patterns in the training distribution.
As a popcorn eating bystander it is striking to scan the top comments and find they alternate so dramatically in tone and conclusions.
Huge fan that Gemini-3 prompted OAI to ship this.
Competition works!
GDPval seems particularly strong.
I wonder why they held this back.
1) Maybe this is uneconomical ?
2) Did the safety somehow hold back the company ?
looking forward to the internet trying this and posting their results over the next week or two.
COMPETITION!
A classic long-form sales pitch. Someone's been reading their Patio11...
It's funny how they don't compare themselves to Gemini and Claude anymore.
I recently built a webapp to summarize hn comment threads. Sharing a summary given there is a lot here: https://hn-insights.com/chat/gpt-52-8ecfpn.
the halving of error rates for image inputs is pretty awesome, this makes it far more practical for issues where it isn't easy to input all the needed context. when I get lazy I'll just shift+win+s the problem and ask one of the chatbots to solve it.
The benchmarks are very impressive. Codex and Opus 4.5 are really good coders already and they keep getting better.
No wall yet and I think we might have crossed the threshold of models being as good or better than most engineers already.
GDPval will be an interesting benchmark and I'll happily use the new model to test spreadsheet (and other office work) capabilities. If they can going like this just a little bit further, much of the office workers will stop being useful.... I don't know yet how to feel about this.
Great for humanity probably but but for the individuals?
Feels a bit rushed. They haven’t even updated their API playground yet, if I select 5.2-chat-latest, I get:
Unsupported parameter: 'top_p' is not supported with this model.
Also, without access to the Internet, it does not seem to know things up to August 2025. A simple test is to ask it about .NET 10 which was already in preview at that time and had lots of public content about its new features.
The model just guessed and waved its hand about, like a student that hadn’t read the assigned book.
Every new model is ‘state-of-the-art’. This term is getting annoying.
Funny that, their front page demo has a mistake. For the waves simulation, the user asks:
>- The UI should be calming and realistic.
Yet what it did is make a sleek frosted glass UI with rounded edges. What it should have done is call a wellness check on the user on suspicion of a co2 leak leading to delirium.
This is a whole bunch of patting themselves on the back.
Let me know when Gemini 3 Pro and Opus 4.5 are compared against it.