Codex has always been better at following agents.md and prompts more, but I would say in the last 3 ...

inerte • yesterday at 10:39 PM • 17 replies • view on HN

Codex has always been better at following agents.md and prompts more, but I would say in the last 3 months both Claude Code got worse (freestyling like we see here) and Codex got EVEN more strict.

80% of the time I ask Claude Code a question, it kinda assumes I am asking because I disagree with something it said, then acts on a supposition. I've resorted to append things like "THIS IS JUST A QUESTION. DO NOT EDIT CODE. DO NOT RUN COMMANDS". Which is ridiculous.

Codex, on the other hand, will follow something I said pages and pages ago, and because it has a much larger context window (at least with the setup I have here at work), it's just better at following orders.

With this project I am doing, because I want to be more strict (it's a new programming language), Codex has been the perfect tool. I am mostly using Claude Code when I don't care so much about the end result, or it's a very, very small or very, very new project.

Replies

kace91 • yesterday at 10:48 PM

>I've resorted to append things like "THIS IS JUST A QUESTION. DO NOT EDIT CODE. DO NOT RUN COMMANDS". Which is ridiculous.

Funny to read that, because for me it's not even new behavior. I have developed a tendency to add something like "(genuinely asking, do not take as a criticism)".

I'm from a more confrontational culture, so I just assumed this was just corporate American tone framing criticism softly, and me compensating for it.

➕ show 6 replies

onion2k • today at 5:52 AM

This is important, but as a warning. At least in theory your agent will follow everything that it has in context, but LLMs rely on 'context compacting' when things get close to the limit. This means an LLM can and will drop your explicit instructions not to do things, and then happily do them because they're not in the context any more. You need to repeat important instructions.

lubujackson • yesterday at 10:56 PM

I feel like people are sleeping on Cursor, no idea why more devs don't talk about it. It has a great "Ask" mode, the debugging mode has recently gotten more powerful, and it's plan mode has started to look more like Claude Code's plans, when I test them head to head.

➕ show 4 replies

AlotOfReading • yesterday at 11:18 PM

I've had some luck taming prompt introspection by spawning a critic agent that looks at the plan produced by the first agent and vetos it if the plan doesn't match the user's intentions. LLMs are much better at identifying rule violations in a bit of external text than regulating their own output. Same reason why they generate unnecessary comments no matter how many times you tell them not to.

➕ show 1 reply

0xbadcafebee • today at 3:08 AM

This is mostly dependent on the agent because the agent sets the system prompt. All coding agents include in the system prompt the instruction to write code, so the model will, unless you tell it not to. But to what extent they do this depends on that specific agent's system prompt, your initial prompt, the conversation context, agent files, etc.

If you were just chatting with the same model (not in an agent), it doesn't write code by default, because it's not in the system prompt.

niobe • today at 3:54 AM

But that's one of the first things you fix in your CLAUDE.md: - "Only do what is asked." - "Understand when being asked for information versus being asked to execute a task."

➕ show 1 reply

thomaslord • today at 1:15 AM

This is extra rough because Codex defaults to letting the model be MUCH more autonomous than Claude Code. The first time I tried it out, it ended up running a test suite without permission which wiped out some data I was using for local testing during development. I still haven't been able to find a straight answer on how to get Codex to prompt for everything like Claude Code does - asking Codex gets me answers that don't actually work.

stavros • yesterday at 10:55 PM

I've added an instruction: "do not implement anything unless the user approves the plan using the exact word 'approved'".

This has fixed all of this, it waits until I explicitly approve.

➕ show 2 replies

clarus • yesterday at 11:20 PM

The solution for this might be to add a ME.md in addition to AGENT.md so that it can learn and write down our character, to know if a question is implicitly a command for example.

chrysoprace • today at 1:25 AM

Maybe I should give Codex a go, because sometimes I just want to ask a question (Claude) and not have it scan my entire working directory and chew up 55k tokens.

hun3 • today at 3:23 AM

Does appending "/genq" work?

Or use the /btw command to ask only questions

hrimfaxi • yesterday at 10:43 PM

> Codex, on the other hand, will follow something I said pages and pages ago, and because it has a much larger context window (at least with the setup I have here at work), it's just better at following orders.

Can you speak more to that setup?

➕ show 1 reply

parhamn • yesterday at 10:43 PM

I added an "Ask" button my agent UI (openade.ai) specifically because of this!

user3939382 • today at 5:11 AM

Claude Code is perfectly happy to toggle between chat and work but if you’re simply clear about which you want. Capital letters aren’t necessary.

darkoob12 • yesterday at 10:50 PM

This is not Claude Code. And my experience is the opposite. For me Codex is not working at all to the point that it's not better than asking the chat bot in the browser.

➕ show 2 replies

casey2 • yesterday at 11:22 PM

For the last 12 months labs have been 1. check-pointing 2. train til model collapse 3. revert to the checkpoint from 3 months ago 4. People have gotten used to the shitty new model Antropic said they "don't do any programming by hand" the last 2 years. Antropic's API has 2 nines

cmrdporcupine • yesterday at 11:17 PM

I'm back on Claude Code this month after a month on Codex and it's a serious downgrade.

Opus 4.6 is a jackass. It's got Dunning-Kruger and hallucinates all over the place. I had forgotten about the experience (as in the Gist above) of jamming on the escape key "no no no I never said to do that." But also I don't remember 4.5 being this bad.

But GPT 5.3 and 5.4 is a far more precise and diligent coding experience.

➕ show 1 reply

alt Hacker News

Replies