What is the reason behind OpenAI being able to release new models very fast?
Since Feb when we got Gemini 3.1, Opus 4.6, and GPT-5.3-Codex we have seen GPT-5.4 and GPT-5.5 but only Opus 4.7 and no new Gemini model.
Both of these are pretty decent improvements.
"our strongest set of safeguards to date"
How much capability is lost, by hobbling models with a zillion protections against idiots?
Every prompt gets evaluated, to ensure you are not a hacker, you are not suicidal, you are not a racist, you are not...
Maybe just...leave that all off? I know, I know, individual responsibility no longer exists, but I can dream.
So according to the benchmarks somewhere in between Opus 4.7 and Mythos
"Sometime with GPT-5.5 I become lazy"
I don't want to be lazy.
If SWE-Bench Verified is no longer a good measure of agentic coding abilities, what benchmark now is?
is there anywhere I can try it? ( I just stopped my pro sub ) but was wondering if there is a playground or 3rd party so i can just test it briefly?
This is the first time openAi include competing models in their benchmarks, always included only openAi models.
Those who are using gpt5.5 how does it compare to Opus 4.6 / 4.7 in terms of code generation?
Surprised to see SWE-Bench Pro only a slight improvement (57.7% -> 58.6%) while Opus 4.7 hit 64.3%. I wonder what Anthropic is doing to achieve higher scores on this - and also what makes this test particular hard to do well in compared to Terminal Bench (which 5.5 seemed to have a big jump in)
ctrl+f "cutoff, 0 results"
Surely it doesn't still have the same ancient data cutoff as 5.4 did?
entering this comments area wondering if it will be full of complaints about the new personality, as with every single LLM update
How does it compare to mythos?
I just prompted GPT-5.5 Pro "Solve Nuclear Fusion" and it one shotted it (kidding obviously)
Which is better GPT-5.5 or Opus 4.7? And for what tasks?
> We are releasing GPT‑5.5 with our strongest set of safeguards to date
...
> we’re deploying stricter classifiers for potential cyber risk which some users may find annoying initially
So we should be expecting to not be able to check our own code for vulnerabilities, because inherently the model cannot know whether I'm feeding my code or someone else's.
It's possible that "smarter" AI won't lead to more productivity in the economy. Why?
Because software and "information technology" generally didn't increase productivity over the past 30 years.
This has been long known as Solow's productivity paradox. There's lots of theories as to why this is observed, one of them being "mismeasurement" of productivity data.
But my favorite theory is that information technology is mostly entertainment, and rather than making you more productive, it distracts you and makes you more lazy.
AI's main application has been information space so far. If that continues, I doubt you will get more productivity from it.
If you give AI a body... well, maybe that changes.
I might just be following too many AI-related people on X, but omg the media blitz around 5.5 is aggressive.
Soo many unconvincing "I've had access for three weeks and omg it's amazing" takes, it actually primes me for it to be a "meh".
I prefer to see for myself, but the gradual rollout, combined with full-on marketing campaign, is annoying.
does it have cached pricing?
I hear its as good as Opus 4.7.
The battle has just begun
Nice to see them openly compare to Opus-4.7… but they don’t compare it against Mythos which says everything you need to know.
The LinkedIn/X influencers who hyped this as a Mythos-class model should be ashamed of themselves, but they’ll be too busy posting slop content about how “GPT-5.5 changes everything”.
literally cannot launch the codex app anymore
Good timing I had just renewed my subscription.
> One engineer at NVIDIA who had early access to the model went as far as to say: "Losing access to GPT‑5.5 feels like I've had a limb amputated.
Everybody understands that you need to make money, but can you tone it down with the f*cking FOMO, please? It sounds just pathetic at this point:
'one engineer at NVIDIA', 'limb amputated'
Put the cunt in a room and give me a handsaw, I want to see how fast he'll give up his arm over some cloud model.
I'd really like to see improvements like these: - Some technical proof that data is never read by open ai. - Proof that no logs of my data or derived data is saved. etc...
> A playable 3D dungeon arena
Where's the demo link?
... sigh. I realize there's little that can be done about this, but I just got through a real-world session determining of Opus 4.7 is meaningfully better than Opus 4.6 or GPT 5.4, and now there's another one to try things with. These benchmark results generally mean little to me in practice.
Anyways, still exciting to see more improvements.
What is the major and minor semver meaning for these models? Is each minor release a new fine-tuning with a new subset of example data while the major releases are made from scratch? Or do they even mean anything at this point?
GPT-5.4 is already an incredible model for code reviews and security audits with the swival.dev /audit command.
The fact that GPT-5.5 is apparently even better at long-running tasks is very exciting. I don’t have access to it yet, but I’m really looking forward to trying it.
Related and insightful: "GPT-5.5: Mythos-Like Hacking, Open to All" [1].
Is Codex receiving 5.4 or 5.5 release?
I am still using Codex 5.3 and haven't switched to GPT 5.4 as I don't like the 'its automatic bro trust us', so wondering is Codex going to get these specific releases at all in the future.
My impression has been that ChatGPT-5.4 has been getting dumber and more exhausting in the last couple of weeks. Like it makes a lot of obvious mistakes, ignores (parts of) prompts. keeps forgetting important facts or requirement.
Maybe this is a crazy theory, but I sometimes feel like they gimp their existing models before a big release to you'll notice more of a "step".
I've stopped trusting these "trust me bro" benchmarks and just started going to LM Arena and looking for the actual benchmark comparisons.
I am sceptical. The generation after 4o models have become crappier and crappier. Hope this one changes the trend. 5.4 is unusable for complex coding work.
I'm still using 5.3 in codex. Are 5.4 and 5.5 better than 5.3 in concrete ways?
Is this the first time OpenAI compared their new release to Anthropic models? Previously they were comparing only to GPT's own previous versions.
ARC-AGI 3 is missing on this list - given that the SOTA before 5.5 <1% if I recall, I wonder if this didn't make meaningful progress.
Not rolled out to my Codex CLI yet, but some users on Reddit claiming it's on theirs.
Next up: Google I/O on May 19?
I have to imagine they'll go to Gemini 3.5 if only for marketing reasons.
If anyone tried it already, how do you feel?
Numbers look too good, wondering if it is benchmaxxed or not
they are using ethical training weights this time!!! /j
Oh shiiiiit boy! An incrementation dropped!!
finally
Umm yeah but this is like every release in the last 3 years.
The big question is: does it still just write slop, or not?
Fool me once, fool me twice, fool me for the 32nd time, it’s probably still just slop.