Too late, personally after how bad 4.6 was the past week I was pushed to codex, which seems to mostl...

buildbot • today at 2:40 PM • 30 replies • view on HN

Too late, personally after how bad 4.6 was the past week I was pushed to codex, which seems to mostly work at the same level from day to day. Just last night I was trying to get 4.6 to lookup how to do some simple tensor parallel work, and the agent used 0 web fetches and just hallucinated 17K very wrong tokens. Then the main agent decided to pretend to implement tp, and just copied the entire model to each node...

Replies

vintagedave • today at 3:39 PM

Same. I stopped my Pro subscription yesterday after entering the week with 70% of my tokens used by Monday morning (on light, small weekend projects, things I had worked on in the past and barely noticed a dent in usage.) Support was... unhelpful.

It's been funny watching my own attitude to Anthropic change, from being an enthusiastic Claude user to pure frustration. But even that wasn't the trigger to leave, it was the attitude Support showed. I figure, if you mess up as badly as Anthropic has, you should at least show some effort towards your customers. Instead I just got a mass of standardised replies, even after the thread replied I'd be escalated to a human. Nothing can sour you on a company more. I'm forgiving to bugs, we've all been there, but really annoyed by indifference and unhelpful form replies with corporate uselessness.

So if 4.7 is here? I'd prefer they forget models and revert the harness to its January state. Even then, I've already moved to Codex as of a few days ago, and I won't be maintaining two subscriptions, it's a move. It has its own issues, it's clear, but I'm getting work done. That's more than I can say for Claude.

➕ show 5 replies

aurareturn • today at 2:48 PM

Funny because many people here were so confident that OpenAI is going to collapse because of how much compute they pre-ordered.

But now it seems like it's a major strategic advantage. They're 2x'ing usage limits on Codex plans to steal CC customers and it seems to be working. I'm seeing a lot of goodwill for Codex and a ton of bad PR for CC.

It seems like 90% of Claude's recent problems are strictly lack of compute related.

➕ show 14 replies

_the_inflator • today at 3:30 PM

Codex really has its place in my bag. I mainly use it, rarely Claude.

Codex just gets it done. Very self-correcting by design while Claude has no real base line quality for me. Claude was awesome in December, but Codex is like a corporate company to me. Maybe it looks uncool, but can execute very well.

Also Web Design looks really smooth with Codex.

OpenAI really impressed me and continues to impress me with Codex. OpenAI made no fuzz about it, instead let results speak. It is as if Codex has no marketing department, just its product quality - kind of like Google in its early days with every product.

onlyrealcuzzo • today at 3:30 PM

I switched to Codex and found it extremely inferior for my use case.

It is much faster, but faster worse code is a step in the wrong direction. You're just rapidly accumulating bugs and tech debt, rather than more slowly moving in the correct direction.

I'm a big fan of Gemini in general, but at least in my experience Gemini Cli is VERY FAR behind either Codex or CC. It's both slower than CC, MUCH slower than Codex, and the output quality considerably worse than CC (probably worse than Codex and orders of magnitude slower).

In my experience, Codex is extraordinarily sycophantic in coding, which is a trait that could t be more harmful. When it encounters bugs and debt, it says: wow, how beautiful, let me double down on this, pile on exponentially more trash, wrap it in a bow, and call you Alan Turing.

It also does not follow directions. When you tell it how to do something, it will say, nah, I have a better faster way, I'll just ignore the user and do my thing instead. CC will stop and ask for feedback much more often.

YMMV.

➕ show 2 replies

deepsquirrelnet • today at 4:01 PM

My tinfoil hat theory, which may not be that crazy, is that providers are sandbagging their models in the days leading up to a new release, so that the next model "feels" like a bigger improvement than it is.

An important aspect of AI is that it needs to be seen as moving forward all the time. Plateaus are the death of the hype cycle, and would tether people's expectations closer to reality.

➕ show 2 replies

desugun • today at 3:08 PM

I guess our conscience of OpenAI working with the Department of War has an expiry date of 6 weeks.

➕ show 9 replies

cube2222 • today at 2:52 PM

I've been using it with `/effort max` all the time, and it's been working better than ever.

I think here's part of the problem, it's hard to measure this, and you also don't know in which AB test cohorts you may currently be and how they are affecting results.

➕ show 2 replies

gonzalohm • today at 2:56 PM

Until the next time they push you back to Claude. At this point, I feel like this has to be the most unstable technology ever released. Imagine if docker had stopped working every two releases

➕ show 1 reply

thisisit • today at 3:48 PM

Personally I find using and managing Claude sessions and limits is getting exhausting and feels similar to calorie counting. You think you are going to have an amazing low calories meal only to realize the meal is full of processed sugars and you overshot the limit within 2-3 bites. Now "you have exhausted your limit for this time. Your session limits resets in next 4 hrs".

➕ show 1 reply

0xbadcafebee • today at 4:29 PM

Usually the problems that cause this kind of thing are:

1) Bad prompt/context. No matter what the model is, the input determines the output. This is a really big subject as there's a ton of things you can do to help guide it or add guardrails, structure the planning/investigation, etc.

2) Misaligned model settings. If temperature/top_p/top_k are too high, you will get more hallucination and possibly loops. If they're too low, you don't get "interesting" enough results. Same for the repeat protection settings.

I'm not saying it didn't screw up, but it's not really the model's fault. Every model has the potential for this kind of behavior. It's our job to do a lot of stuff around it to make it less likely.

The agent harness is also a big part of it. Some agents have very specific restrictions built in, like max number of responses or response tokens, so you can prevent it from just going off on a random tangent forever.

alvis • today at 2:43 PM

I don't have much quality drop from 4.6. But I also notice that I use codex more often these days than claude code

➕ show 2 replies

arrakeen • today at 2:56 PM

so even with a new tokenizer that can map to more tokens than before, their answer is still just "you're not managing your context well enough"

"Opus 4.7 uses an updated tokenizer that [...] can map to more tokens—roughly 1.0–1.35× depending on the content type.

[...]

Users can control token usage in various ways: by using the effort parameter, adjusting their task budgets, or prompting the model to be more concise."

frank-romita • today at 2:51 PM

That's wild that you think 4.6 is bad..... Each model has its strengths and weaknesses I find that Codex is good for architectural design and Claude Is actually better the engineering and building

siegers • today at 3:25 PM

I enjoy switching back and forth and having multi-agent reviews. I'm enjoying Codex also but having options is the real win.

nico • today at 4:21 PM

I do feel that CC sometimes starts doing dumb tasks or asking for approval for things that usually don’t really need it. Like extra syntax checks, or some greps/text parsing basic commands

➕ show 1 reply

muzani • today at 2:46 PM

For me, making it high effort just fixed all the quality problems, and even cut down on token use somehow

➕ show 1 reply

queuep • today at 2:54 PM

Before opus released we also saw huge backlash with it being dumber.

Perhaps they need the compute for the training

sgt • today at 4:48 PM

Strange. Opus 4.6 has been great for me. On Max 20x

OtomotO • today at 2:52 PM

Same for me.

I cancelled my subscription and will be moving to Codex for the time being.

Tokens are way too opaque and Claude was way smarter for my work a couple of months ago.

hk__2 • today at 3:14 PM

Meh. At $work we were on CC for one month, then switched to Codex for one month, and now will be on CC again to test. We haven’t seen any obvious difference between CC and Codex; both are sometimes very good and sometimes very stupid. You have to test for a long time, not just test one day and call it a benchmark just because you have a single example.

geooff_ • today at 2:51 PM

I've noticed the same over the last two weeks. Some days Claude will just entirely lose its marbles. I pay for Claude and Codex so I just end up needing to use codex those days and the difference is night and day.

r0fl • today at 3:07 PM

Same! I thought people were exaggerating how bad Claude has gotten until it deleted several files by accident yesterday

Codex isn’t as pretty in output but gets the job done much more consistently

keeganpoppen • today at 5:02 PM

codex low-key seems to be better than claude. and i say this as an 18-hour-a-day user of both (mostly claude)

estimator7292 • today at 4:12 PM

Anecdotally, codex has been burning through way more tokens for me lately. Claude seems to just sit and spin for a long time doing nothing, but at least token use is moderate.

All options are starting to suck more and more

tiel88 • today at 3:43 PM

I've been raging pretty hard too. Thought either I'm getting cleverer by the day or Claude has been slipping and sliding toward the wrong side of the "smart idiot" equation pretty fast.

Have caught it flat-out skipping 50% of tasks and lying about it.

varispeed • today at 4:26 PM

How do you get codex to generate any code?

I describe the problem and codex runs in circles basically:

codex> I see the problem clearly. Let me create a plan so that I can implement it. The plan is X, Y, Z. Do you want me to implement this?

me> Yes please, looks good. Go ahead!

codex> Okay. Thank you for confirming. So I am going to implement X, Y, Z now. Shall I proceeed?

me> Yes, proceed.

codex> Okay. Implementing.

...codex is working... you see the internal monologue running in circles

codex> Here is what I am going to implement: X, Y, Z

me> Yes, you said that already. Go ahead!

codex> Working on it.

...codex in doing something...

codex> After examining the problem more, indeed, the steps should be X, Y, Z. Do you want me to implement them?

etc.

Very much every sessions ends up being like this. I was unable to get any useful code apart from boilerplate JS from it since 5.4

So instead I just use ChatGPT to create a plan and then ask Opus to code, but it's a hit and miss. Almost every time the prompt seems to be routed to cheaper model that is very dumb (but says Opus 4.6 when asked). I have to start new session many times until I get a good model.

➕ show 2 replies

te_chris • today at 3:32 PM

I try codex, but i hate 5.4's personality as a partner. It's a demon debugger though. but working closely with it, it's so smug and annoying.

cmrdporcupine • today at 2:46 PM

Yep, I'll wait for the GPT answer to this. If we're lucky OpenAI will release a new GPT 5.5 or whatever model in the next few days, just like the last round.

I have been getting better results out of codex on and off for months. It's more "careful" and systematic in its thinking. It makes less "excuses" and leaves less race conditions and slop around. And the actual codex CLI tool is better written, less buggy and faster. And I can use the membership in things like opencode etc without drama.

For March I decided to give Claude Code / Opus a chance again. But there's just too much variance there. And then they started to play games with limits, and then OpenAI rolled out a $100 plan to compete with Anthropic's.

I'm glad to see the competition but I think Anthropic has pissed in the well too much. I do think they sent me something about a free month and maybe I will use that to try this model out though.

➕ show 2 replies

alt Hacker News

Replies