> We used the claude code and codex harness and I implemented some prs they needed with gpt5.5 an...

Aurornis • today at 3:25 PM • 8 replies • view on HN

> We used the claude code and codex harness and I implemented some prs they needed with gpt5.5 and opus4.7 and asked them to identify which came from which only from the code.

> Couldn’t tell.

Why would you expect them to be able to recognize the signature of a model from a pair of PRs? I don’t understand why you think this is a useful test for anything when we have numerous benchmarks that run 100s of tests on models and both GPT-5.5 and Opus-4.8 perform similarly.

I have subscriptions to both. I run both on max reasoning. It is interesting to see the relative strengths and weaknesses of each model. You won’t always see it if you’re just scanning code. Some times one will spin for a long time on certain problems where the other has no problem finding the appropriate parts of the codebase and getting an efficient solution.

antirez made a comment that he and others found GPT-5.5 to be better at the optimization tasks he was working on than Opus. There are other classes of tasks where GPT-5.5 consistently stumbles where Opus will get a solution quicker. Lately I’ve been working on some code where neither model comes up with a good solution. That’s just how LLMs go.

The only reason you have seen more activity about Claude is that they got there first. Codex has been a step behind and GPT couldn’t match Opus at first. You’re testing them after they’ve closed the gap.

Replies

vunderba • today at 4:39 PM

Yup, OP is conflating so many things that the comparison has all the scientific rigor of the Pepsi Challenge.

For a developer using an LLM on a daily basis, the experience is about much more than just the resultant code.

There’s everything from:

- how often you had to manually steer the model

- how frequently you needed to course-correct

- how much detail you had to provide up front

- how was the interaction process (sycophantic, etc)

- how well did it handle MCP and external tooling?

- how effectively could it pull in additional information from external sources such as the web?

- how fast did it produce code?

- how much did it cost?

Many of my friends who are devs use things like OpenCode CLI with Openrouter because they switch between the various SOTA models so often. Just because you saw a Claude "meetup" doesn't prove anything other than somebody chose the name because it resonated more than "Generic LLM Meetup".

➕ show 1 reply

Wowfunhappy • today at 4:37 PM

Kind of orthogonal to the discussion, but could you broadly describe the code you're working on that both models are bad at? One thing I'm still struggling with is figuring out what types of code LLMs can vs cannot write.

➕ show 3 replies

kaydub • today at 5:38 PM

I just don't believe non-deterministic tools can actually be benchmarked. It's all hoopla to me.

I flip between models all the time. Makes little difference. Sometimes one model is faster or better than another but there's no rhyme or reason why.

➕ show 2 replies

ryandrake • today at 4:42 PM

I think the subscription pricing model kind of incentivizes developers (at least hobby developers) to pick one and go all in on it. For someone who has probably never paid $20/mo for a piece of software in their life, $20/mo is kind of a big commitment, and the pay-per-token schemes are reportedly much more expensive for the equivalent blob of coding they enable. So you "pick one," plonk down the $20, and use it as much as you can in the month so it's worth it. If you want to try the other one, you don't renew next month, and plonk down another $20 for the other one.

You can go back and forth and compare since you pay for both subscriptions, but is that a usual case? I'd guess most developers picked one in 2025 and haven't gone back. Just like most people just pick a bank for their checking account and never change it.

➕ show 1 reply

riedel • today at 5:26 PM

Actually it would be fun to try to test the developer personality of the models.

Actually there is a nice body of work by Steven Clarke on cognitive dimensions of notations/APIs and the interaction with developer personalities.

I wonder if the same holds for AI models and harnesses.

amazingamazing • today at 3:41 PM

I am not sure why the past matters here. I am talking about now, it is a fast moving space.

As for the test, of course the output matters. Take image models for example. Differences are clear as day.

Should the fact that OpenAI existed before Anthropic did at all matter? No, imo. I would have used opus 4.8, but it only just came out- fast moving space

➕ show 4 replies

osigurdson • today at 3:44 PM

Exactly. Popular opinion is behind reality by several months. Claude used to be significantly better, now it is basically the same.

➕ show 1 reply

fmbb • today at 5:38 PM

> Some times one will spin for a long time on certain problems where the other has no problem finding the appropriate parts of the codebase and getting an efficient solution.

Surely this is just to the random nature of these stochastic parrots?

Do you mean you have identified a class of problems Claude always stalls on and another class of problems Codex always stalls on? What identifies these different classes of problems you see? How would you say Claude is stronger than Codex and vice versa? Why?

alt Hacker News

Replies