I was very skeptical about Codex at the beginning, but now all my coding tasks start with Codex. It's not perfect at everything, but overall it's pretty amazing. Refactoring, building something new, building something I'm not familiar with. It is still not great at debugging things.
One surprising thing that codex helped with is procrastination. I'm sure many people had this feeling when you have some big task and you don't quite know where to start. Just send it to Codex. It might not get it right, but it's almost always good starting point that you can quickly iterate on.
I’ve been using Codex CLI heavily after moving off Claude Code and built a containerized starter to run Codex in different modes: timers/file triggers, API calls, or interactive/single-run CLI. A few others are already using it for agentic workflows. If you want to run Codex securely (or not) in a container to test the model or build workflows, check out https://github.com/DeepBlueDynamics/codex-container.
It ships with 300+ MCP tools (crawl, Google search, Gmail/GCal/GDrive, Slack, scheduling, web indexing, embeddings, transcription, and more). Many came from tools I originally built for Claude Desktop—OpenAI’s MCP has been stable across 20+ versions so I prefer it.
I will note I usually run this in Danger mode but because it runs in a container, it doesn't have access to ENVs I don't want it messing with, and have it in a directory I'm OK with it changing or poking about in.
Headless browser setup for the crawl tools: https://github.com/DeepBlueDynamics/gnosis-crawl.
My email is in my profile if anyone needs help.
The GPT models, in my experience, have been much better for backend than the Claude models. They're much slower, but produce logic that is more clear, and code that is more maintainable. A pattern I use is, setup a Github issue with Claude plan mode, then have Codex execute it. Then come back to Claude to run custom code review plugins. Then, of course review it with my own eyes before merging the PR.
My only gripe is I wish they'd publish Codex CLI updates to homebrew the same time as npm :)
The cybersecurity angle is interesting, because in my experience OpenAI stuff has gotten terrible at cybersecurity because it simply refuses to do anything that can be remotely offensive (as in the opposite of "defensive"). I really thought we as an industry had learned our lesson that blocking "good guys" (aka white-hats) from offensive tools/capabilities only empowers the gray-hat/black-hats and puts us at a disadvantage. A good defense requires some offense. I sure hope they change that.
It's interesting that they're foregrounding "cyber" stuff (basically: applied software security testing) this way, but I think we've already crossed a threshold of utility for security work that doesn't require models to advance to make a dent --- and won't be responsive to "responsible use" controls. Zero-shotting is a fun stunt, but in the real world what you need is just hypothesis identification (something the last few generations of models are fine at) and then quick building of tooling.
Most of the time spent in vulnerability analysis is automatable grunt work. If you can just take that off the table, and free human testers up to think creatively about anomalous behavior identified for them, you're already drastically improving effectiveness.
Fascinating to see the increasing acceptance of AI generated code in HN comments.
We've come a long way since gpt-3.5, and it's rewarding to see people who are willing to change their cached responses
Somehow Codex for me is always way worse than the base models.
Especially in the CLI, it seems that its so way too eager to start writing code nothing can stop it, not even the best Agents.md.
Asking it a question or telling it to check something doesn‘t mean it should start editing code, it means answer the question. All models have this issue to some degree, but codex is the worst offender for me.
> In parallel, we’re piloting invite-only trusted access to upcoming capabilities and more permissive models for vetted professionals and organizations focused on defensive cybersecurity work. We believe that this approach to deployment will balance accessibility with safety.
Yeah, this makes sense. There's a fine line between good enough to do security research and good enough to be a prompt kiddie on steroids. At the same time, aligning the models for "safety" would probably make them worse overall, especially when dealing with security questions (i.e. analyse this code snippet and provide security feedback / improvements).
At the end of the day, after some KYC I see no reason why they shouldn't be "in the clear". They get all the positive news (i.e. our gpt666-pro-ultra-krypto-sec found a CVE in openBSD stable release), while not being exposed to tabloid style titles like "a 3 year old asked chatgpt to turn on the lights and chatgpt hacked into nasa, news at 5"...
Can anyone elaborate on what they're referring to here?
> GPT‑5.2-Codex has stronger cybersecurity capabilities than any model we’ve released so far. These advances can help strengthen cybersecurity at scale, but they also raise new dual-use risks that require careful deployment.
I'm curious what they mean by the dual-use risks.
Codex code review has been astounding for my distributed team of devs. Very well spent money.
GPT 5.1 has been pure magic in VSCode via the Codex plugin. I can't tell any difference with 5.2 yet. I hope the Codex plugin gets feature parity with CC, Cursor, Kilo Code etc soon. That should increase performance a bit more through scaffolding.
I had assumed OpenAI was irrelevant, but 5.1 has been so much better than Gemini.
would love to see some comparison numbers to Gemini and Claude, especially with this claim:
"The most advanced agentic coding model for professional software engineers"
lol I love how OpenAI just straight up doesn't compare their model to others on these release pages. Basically telling us they know Gemini and Opus are better but they don't want to draw attention to it
We have made this model even better at programming in Windows. Give it a shot :)
Recently I’ve had the best results with Gemini; with this I’ll have to go back to Codex for my next project. It takes time to get a feel for the capabilities of a model it’s sort of tedious having new ones come out so frequently.
It has become very quickly unfashionable for people to say they like the Codex CLI. I still enjoy working with it and my only complaint is that its speed makes it unideal for pair coding.
On top of that, the Codex CLI team is responsive on github and it's clear that user complaints make their way to the team responsible for fine tuning these models.
I run bake offs on between all three models and GPT 5.2 generally has a higher success rate of implementing features, followed closely by Opus 4.5 and then Gemini 3, which has troubles with agentic coding. I'm interested to see how 5.2-codex behaves. I haven't been a fan of the codex models in general.
I've been doing some reverse engineering recently and have found Gemini 3 Pro to be the best model for that, surprisingly much better than Opus 4.5. Maybe it's time to give Codex a try
Why aren’t they making gpt-5.2-codex available in the API at launch?
My only concern with Codex is that it's not possible to delete tasks.
This is a privacy and security risk. Your code diffs and prompts are there (seemingly) forever. Best you can do is "archive" them, which is a fancy word for "put it somewhere else so it doesn't clutter the main page".
> <PLACEHOLDER FOR FRONTEND HTML ASSETS>
> [ADD/LINK TO ROLLOUT THAT DISCOVERED VULNERABILITY]
What’s up with these in the article?
Thanks gosh, we have so bloody competition.
The models are so good, unbelievable good. And getting better weekly, including pricing.
GPT 5.2 has been very good in codex can't wait to try this new modal. Will see how it compares to Opus 4.5
> For example, just last week, a security researcher using GPT‑5.1-Codex-Max with Codex CLI found and responsibly disclosed (opens in a new window) a vulnerability in React that could lead to source code exposure.
Translation: "Hey y'all! Get ready for a tsunami of AI-generated CVEs!"
The models aren't smart enough to be fully agentic. This is why Claude Code human-in-the-loop process is 100x more ergonomic.
In all my unpublished tests, which focus on 1. unique logic puzzles that are intentionally adjacent to existing puzzles and 2. implementing a specific unique CRDT algorithm that is not particularly common but has an official reference implementation on github (so the models definitely been trained on it) I find that 5.2 overfits to the more common implementation and will actively break working code and puzzles.
I find it to be incorrectly pattern matching with a very narrow focus and will ignore real documented differences even when explicitly highlighted in the prompt text (this is X crdt algo not Y crdt algo.)
I've canceled my subscription, the idea that on any larger edits it will just start wrecking nuance and then refuse to accept prompts that point this out is an extremely dangerous form of target fixation.
I hope this makes a big jump forward for them. I used to be a heavy Codex user, but it has just been so much worse than Claude Code both in UX and in actual results that I've completely given up on it. Anthropic needs a real competitor to keep them motivated and they just don't have one right now, so I'd really like to see OpenAI get back in the game.
Fwiw, I had some well defined tickets in Jira assigned to me, and 5.2 absolutely crushed them. Still waiting on CI, but games over.
very minuscule improvement, I suspect GPT 5.2 is already coding model from the ground up and this codex model include "various optimization + tool" on tops
They found one React bug and spend pages on "frontier" "cyber" nonsense. They make these truly marvelous models only available to "vetted" "security professionals".
I can imagine what the vetting looks like: The professionals are not allowed to disclose that the models don't work.
EDIT: It must really hurt that ORCL is down 40% from its high due to overexposure in OpenAI.
So, uh, I've been being and idiot and running it in yolo mode, and twice now it's gone and deleted the entire project directory, wiping out all of my work. Thankfully I have backups and it's my fault for playing with fire, but yeesh.
I have https://gist.github.com/fragmede/96f35225c29cf8790f10b1668b8... as a guard against that, for anyone that's stupid enough like me to run it in yolo mode and wants to copy it.
Codex also has command line options so you can specifically prohibit running rm in bash, so look those up too.
Gotta love only comparing the model to other openai models and just like yesterday's gemini thread, the vibes in this thread are so astroturfed. I guess it makes sense for the frontier labs to want to win the hearts and minds of silicon valley.
Pathetic. They got people working a week before christmas for this?
Devstral Small 2 Instruct running locally seems about as capable with the upside that when it's wrong its very obvious instead of covering it in bullshit.
I actually have 0 enthusiasm for this model. When GPT 5 came out it was clearly the best model, but since Opus 4.5, GPT5.x just feels so slow. So, I am going to skip all `thinking` releases from OpenAI and check them again only if they come up with something that does not rely so much on thinking.
If anyone from OpenAI is reading this -- a plea to not screw with the reasoning capabilities!
Codex is so so good at finding bugs and little inconsistencies, it's astounding to me. Where Claude Code is good at "raw coding", Codex/GPT5.x are unbeatable in terms of careful, methodical finding of "problems" (be it in code, or in math).
Yes, it takes longer (quality, not speed please!) -- but the things that it finds consistently astound me.