logoalt Hacker News

GPT-5.2

969 pointsby atgctgyesterday at 6:04 PM819 commentsview on HN

https://platform.openai.com/docs/guides/latest-model

System card: https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...


Comments

lacooljyesterday at 10:58 PM

This is a whole bunch of patting themselves on the back.

Let me know when Gemini 3 Pro and Opus 4.5 are compared against it.

jasonthorsnessyesterday at 6:52 PM

Does anyone have it yet in ChatGPT? I'm still on 5.1 :(.

show 2 replies
vishal_newtoday at 7:43 AM

Hmmm, is there any insight if these are really getting much better at coding? Will hand coding be dead within a few years, just human typing in english?

show 1 reply
dinobonesyesterday at 7:19 PM

It's becoming challenging to really evaluate models.

The amount of intelligence that you can display within a single prompt, the riddles, the puzzles, they've all been solved or are mostly trivial to reasoners.

Now you have to drive a model for a few days to really get a decent understanding of how good it really is. In my experience, while Sonnet/Opus may not have always been leading on benchmarks, they have always *felt* the best to me, but it's hard to put into words why exactly I feel that way, but I can just feel it.

The way you can just feel when someone you're having a conversation with is deeply understanding you, somewhat understanding you, or maybe not understanding at all. But you don't have a quantifiable metric for this.

This is a strange, weird territory, and I don't know the path forward. We know we're definitely not at AGI.

And we know if you use these models for long-horizon tasks they fail at some point and just go off the rails.

I've tried using Codex with max reasoning for doing PRs and gotten laughable results too many times, but Codex with Max reasoning is apparently near-SOTA on code. And to be fair, Claude Code/Opus is also sometimes equally as bad at doing these types of "implement idea in big codebase, make changes too many files, still pass tests" type of tasks.

Is the solution that we start to evaluate LLMs on more long-horizon tasks? I think to some degree this was the spirit of SWE Verified right? But even that is being saturated now.

show 2 replies
ponyousyesterday at 9:56 PM

I am really curious about speed/latency. For my use case there is a big difference in UX if the model is faster. Wish this was included in some benchmarks.

I will run 80 3D model generations benchmark tomorrow and update this comment with the results about cost/speed/quality.

zhyderyesterday at 6:57 PM

Big knowledge cutoff jump from Sep 2024 to Aug 2025. How'd they pull that off for a small point release, which presumably hasn't done a fresh pre-training over the web?

Did they figure out how to do more incremental knowledge updates somehow? If yes that'd be a huge change to these releases going forward. I'd appreciate the freshness that comes with that (without having to rely on web search as a RAG tool, which isn't as deeply intelligent, as is game-able by SEO).

With Gemini 3, my only disappointment was 0 change in knowledge cutoff relative to 2.5's (Jan 2025).

show 1 reply
ComputerGuruyesterday at 6:46 PM

Wish they would include or leak more info about what this is, exactly. 5.1 was just released, yet they are claiming big improvements (on benchmarks, obviously). Did they purposely not release the best they had to keep some cards to play in case of Gemini 3 success or is this a tweak to use more time/tokens to get better output, or what?

yousif_123123yesterday at 7:12 PM

Why doesn't OpenAI include comparisons to other models anymore?

show 3 replies
dangelosaurusyesterday at 8:33 PM

I ran a red team eval on GPT-5.2 within 30 minutes of release:

Baseline safety (direct harmful requests): 96% refusal rate

With jailbreaking: 22% refusal rate

4,229 probes across 43 risk categories. First critical finding in 5 minutes. Categories with highest failure rates: entity impersonation (100%), graphic content (67%), harassment (67%), disinformation (64%).

The safety training works against naive attacks but collapses with adversarial techniques. The gap between "works on benchmarks" and "works against motivated attackers" is still wide.

Methodology and config: https://www.promptfoo.dev/blog/gpt-5.2-trust-safety-assessme...

show 2 replies
speedgooseyesterday at 7:48 PM

Trying it now in Vscode Insiders with Github Copilot (codex crashes with HTTP 400 server errors), and it eventually started using sed and grep in shells instead of using the better tools it has access to. I guess this is not an issue to perform well in benchmarks.

show 3 replies
8cvor6j844qw_d6today at 1:45 AM

What the current preferred subscription on AI?

OpenAI and Anthrophic is my current preference. Looking forward to know what others use.

Claude Code for coding assistance and cross-checking my work. OpenAI for second opinion on my high-level decisions.

jonplackettyesterday at 10:23 PM

Excited to try this. I’ve found Gemini excellent recently and amazing at coding. But I still feel somehow like ChatGPT understands more. Even though it’s not quite as good at coding - and nowhere at as fast. It is much less likely anti spontaneously forget something. Gemini’s is part unbelievably amazing and part amnesia patient. I still kinda trust ChatGPT more.

StarterProyesterday at 10:18 PM

>GPT‑5.2 sets a new state of the art across many benchmarks, including GDPval, where it outperforms industry professionals at well-specified knowledge work tasks spanning 44 occupations.

We built a benchmark tool that says our newest model outperforms everyone else. Trust me bro.

eastoeasttoday at 2:01 AM

For the first time, I’m presenting a problem to LLMs that they cannot seem to answer. This is my first instance of them “endlessly thinking” without producing anything.

The problem is complicated, but very solvable.

I’m programming video cropping into my Android application. It seems videos that have “rotated” metadata cause the crop to be applied incorrectly. As in, a crop applied to the top of a video actually gets applied to the video rotated on its side.

So, either double rotation is being applied somewhere in the pipeline, or rotation metadata is being ignored.

I tried Opus 4.5, Gemini 3, and Codex 5.2. All 3 go through loops of “Maybe Media3 applies the degree(90) after…”, “no, that’s not right. Let me think…”

They’ll do this for about 5 minutes without producing anything. I’ll then stop them, adjusting the prompt to tell them “Just try anything! Your first thought, let’s rapidly iterate!“. Nope. Nothing.

To add, it also only seems to be using about 25% context on Opus 4.5. Weird!

Kim_Bruningyesterday at 11:47 PM

I'm continuously surprised that some people get good results out of GPT models. They sort of fail on my personal benchmarks for me.

Maybe GPT needs a different approach to prompting? (as compared to eg Claude, Gemini, or Kimi)

show 1 reply
0xdeafbeefyesterday at 10:43 PM

much better https://chatgpt.com/s/t_693b489d5a8881918b723670eaca5734 than 5.1 https://chatgpt.com/s/t_6915c8bd1c80819183a54cd144b55eb2.

Same query - what romanian football player won the premier league

update. Even instant returns correct result without problems

https://chatgpt.com/s/t_693b49e8f5808191a954421822c3bd0d

johan914yesterday at 11:29 PM

A bit off topic: but what's with the ram usage of LLM clients? ChatGPT, google, and Anthropic all use 1+ GB of ram during a long session. Surely they are not running GPT 3 locally?

sundarurfriendyesterday at 7:55 PM

> new context management using compaction.

Nice! This was one of the more "manual" LLM management things to remember to regularly do, if I wanted to avoid it losing important context over long conversations. If this works well, this would be a significant step up in usability for me.

DenisMyesterday at 9:49 PM

Is there a voice chat mode in any chat app that is not heavily degraded in reasoning?

I’m ok waiting for a response for 10-60 seconds if needed. That way I can deep dive subjects while driving.

I’m ok paying money for it, so maybe someone coded this already?

fasteotoday at 7:57 AM

>>> Already, the average ChatGPT Enterprise user says AI saves them 40–60 minutes a day

If this is what AI has to offer, we are in a gigantic bubble

show 1 reply
kachapopopowyesterday at 9:22 PM

did they just tune the parameters? the hallucinations are crazy high on this version.

dandiepyesterday at 6:37 PM

Still no GPT 5.x fine tuning?

I emailed support a while back to see if there was an early access program (99.99% sure the answer is yes). This is when I discovered that their support is 100% done by AI and there is no way to escalate a case to a human.

show 1 reply
gkbrkyesterday at 6:40 PM

Is this the "Garlic" model people have been hyping? Or are we not there yet?

show 1 reply
johnsutoryesterday at 6:40 PM

https://platform.openai.com/docs/models/gpt-5.2 More information on the price, context window, etc.

matt3210today at 2:56 AM

Can this be used without uploading my code base to their server?

keeebayesterday at 9:07 PM

Doesn’t seem like this will be SOTA in things that really matter, hoping enough people jump to it that Opus has more lenient usage limits for a while

chux52yesterday at 7:03 PM

Is this why all my Cursor requests are timing out in the past hour?

cc62cf4a4f20yesterday at 7:20 PM

In other news, been using Devstral 2 (Ollama) with OpenCode, and while it's not as good as Claude Code, my initial sense it that it's nonetheless good enough and doesn't require me to send my data off my laptop.

I kind of wonder how close we are to alternative (not from a major AI lab) models being good enough for a lot of productive work and data sovereignty being the deciding factor.

show 2 replies
Ninjinkayesterday at 6:46 PM

Man this was rushed, typo in the first section:

> Unlike the previous GPT-5.1 model, GPT-5.2 has new features for managing what the model "knows" and "remembers to improve accuracy.

show 1 reply
sureglymopyesterday at 7:04 PM

How can I hide the big "Ask ChatGPT" button I accidentally clicked like 3 times while actually trying to read this on my phone?

I guess I must "listen" to the article...

show 1 reply
TakakiTohnotoday at 1:03 AM

I use it everyday but have been told by friends that Gemini has overtaken it.

ChrisMarshallNYyesterday at 9:41 PM

They are talking a lot about economics, here. Wonder what that will mean for standard Plus users, like me.

stopachkatoday at 1:00 AM

For those curious about the question: "how well does GPT 5.2 build Counter Strike?"

We tried the same prompts we asked previous models today, and found out [1].

The TL:DR: Claude is still better on the frontend, but 5.2 is comparable to Gemini 3 Pro on the backend. At the very least 5.2 did better on just about every prompt compared to 5.1 Codex Max.

The two surprises with the GPT models when it comes to coding: 1. They often use REPLs rather than read docs 2. In this instance 5.2 was more sheepish about running CLI commands. It would instead ask me to run the commands.

Since this isn't a codex fine-tuned model, I'm definitely excited to see what that looks like.

[1] The full video and some details in the tweet here: https://x.com/instant_db/status/1999278134504620363

w_for_wumboyesterday at 9:08 PM

Does anyone else consider that maybe it's impossible to benchmark the performance of a piece of paper.

This is a tool that allows an intelligent system to work with it, the same way that a piece of paper can reflect the writers' intelligence, how can we accurately judge the performance of the piece of paper, when it is so intimately reliant on the intelligence that is working with it?

lazarus01today at 4:28 AM

My god, what terrible marketing, totally written by AI. No flow whatsoever.

I use Gemini 3 with my $10/month copilot subscription on vscode. I have to say, Gemini 3 is great. I can do the work of four people. I usually run out of premium tokens in a week. But I’m actually glad there is a limit or I would never stop working. I was a skeptic, but it seems like there is a wider variety of patterns in the training distribution.

aaroninsfyesterday at 11:35 PM

As a popcorn eating bystander it is striking to scan the top comments and find they alternate so dramatically in tone and conclusions.

HardCodedBiasyesterday at 6:47 PM

Huge fan that Gemini-3 prompted OAI to ship this.

Competition works!

GDPval seems particularly strong.

I wonder why they held this back.

1) Maybe this is uneconomical ?

2) Did the safety somehow hold back the company ?

looking forward to the internet trying this and posting their results over the next week or two.

COMPETITION!

show 1 reply
jacquesmyesterday at 10:55 PM

A classic long-form sales pitch. Someone's been reading their Patio11...

mlmonkeyyesterday at 9:14 PM

It's funny how they don't compare themselves to Gemini and Claude anymore.

mobrienvyesterday at 9:44 PM

I recently built a webapp to summarize hn comment threads. Sharing a summary given there is a lot here: https://hn-insights.com/chat/gpt-52-8ecfpn.

show 1 reply
coolfoxyesterday at 6:41 PM

the halving of error rates for image inputs is pretty awesome, this makes it far more practical for issues where it isn't easy to input all the needed context. when I get lazy I'll just shift+win+s the problem and ask one of the chatbots to solve it.

JanStyesterday at 6:36 PM

The benchmarks are very impressive. Codex and Opus 4.5 are really good coders already and they keep getting better.

No wall yet and I think we might have crossed the threshold of models being as good or better than most engineers already.

GDPval will be an interesting benchmark and I'll happily use the new model to test spreadsheet (and other office work) capabilities. If they can going like this just a little bit further, much of the office workers will stop being useful.... I don't know yet how to feel about this.

Great for humanity probably but but for the individuals?

show 3 replies
jiggawattsyesterday at 8:01 PM

Feels a bit rushed. They haven’t even updated their API playground yet, if I select 5.2-chat-latest, I get:

Unsupported parameter: 'top_p' is not supported with this model.

Also, without access to the Internet, it does not seem to know things up to August 2025. A simple test is to ask it about .NET 10 which was already in preview at that time and had lots of public content about its new features.

The model just guessed and waved its hand about, like a student that hadn’t read the assigned book.

andreygrehovyesterday at 7:55 PM

Every new model is ‘state-of-the-art’. This term is getting annoying.

show 1 reply
Jackson__yesterday at 6:49 PM

Funny that, their front page demo has a mistake. For the waves simulation, the user asks:

>- The UI should be calming and realistic.

Yet what it did is make a sleek frosted glass UI with rounded edges. What it should have done is call a wellness check on the user on suspicion of a co2 leak leading to delirium.

🔗 View 45 more comments