logoalt Hacker News

amazingamazingtoday at 2:09 PM77 repliesview on HN

I never want to hear from developers again that they are not susceptible to marketing. I see meet ups specifically about Claude often.

Modern tupperware party.

A colleague was convinced Claude is better so we played a game. We used the claude code and codex harness and I implemented some prs they needed with gpt5.5 and opus4.7 and asked them to identify which came from which only from the code.

Couldn’t tell.

Edit: i bet 99% of people here, if presented with a test where i gave 5 models but all of the results came from one, would not be able to discern this. Just vibes all the way down.


Replies

Aurornistoday at 3:25 PM

> We used the claude code and codex harness and I implemented some prs they needed with gpt5.5 and opus4.7 and asked them to identify which came from which only from the code.

> Couldn’t tell.

Why would you expect them to be able to recognize the signature of a model from a pair of PRs? I don’t understand why you think this is a useful test for anything when we have numerous benchmarks that run 100s of tests on models and both GPT-5.5 and Opus-4.8 perform similarly.

I have subscriptions to both. I run both on max reasoning. It is interesting to see the relative strengths and weaknesses of each model. You won’t always see it if you’re just scanning code. Some times one will spin for a long time on certain problems where the other has no problem finding the appropriate parts of the codebase and getting an efficient solution.

antirez made a comment that he and others found GPT-5.5 to be better at the optimization tasks he was working on than Opus. There are other classes of tasks where GPT-5.5 consistently stumbles where Opus will get a solution quicker. Lately I’ve been working on some code where neither model comes up with a good solution. That’s just how LLMs go.

The only reason you have seen more activity about Claude is that they got there first. Codex has been a step behind and GPT couldn’t match Opus at first. You’re testing them after they’ve closed the gap.

show 8 replies
epistasistoday at 2:50 PM

Calling this a "tupper ware" seems a bit emotional, you're intentionally disregarding many things that matter for devs in order to try to claim equivalence, rather than paying attention to the actual process of software creation.

For example in your "test" you're only looking at output and ignoring the entire process of creation.

In addition to that process, you're ignoring that Claude Code was first and better for a long time, why would people switch for something that produces the same output? Claude Code has been way ahead in the process of agentic software creation for a long time, I still prefer its features. Even though I think that Opus 4.7 was a big step backwards, and I've been getting worse results seemingly every day with the churn of features at Claude Code, some of that may also be me testing the bounds of how little I can specify and still get acceptable results, so it's hard to know.

Calling all these concrete realities "marketing" is itself you trying to market Codex as "good enough" instead of paying attention to how we got where we are and where we will go in the future.

show 4 replies
afavourtoday at 3:18 PM

You're overestimating the extent to which individual developers have a choice here. My employer signed up for a Claude Code membership, I use Claude Code. I cannot use Codex.

Anecdotally I hear of folks with workplace Claude Code subscriptions all the time. I'm not sure I've ever heard someone talk about their workplace Codex subscription. Anthropic clearly did a far better job chasing corporate customers while OpenAI was busy chasing consumers with Sora etc.

show 4 replies
jnovektoday at 2:22 PM

I can’t tell the difference between code written in vim or vs code but it matters substantially to the person writing the code. There’s stuff beyond just the output that goes into tool choice.

show 4 replies
utopiahtoday at 2:30 PM

Ah that's always SO fun. It doesn't matter how "smart" the person actually are (or think they are) we are ALL susceptible to influence and blind tests are shockingly simple to implement.

Convinced you can distinguish A from B? Ok! No problem, let's try! Can be at the dinner table for fancy wine or with agents, it's all the same, you try an option, another option, maybe all options from the same, and if you reliably can't tell well kudos, you are just like the rest of us!

It's easy to "know" in retrospect but blind test is where genuine difference can be found. Or not.

show 2 replies
brooksttoday at 2:17 PM

This is like saying you gave a Taylor Swift fan sheet music from 1984 and from Michael Jackson’s thriller and they couldn’t tell the difference.

I have a strong affinity for Claude Code because of the interaction experience and overall tone / vibe / process. I am 100% willing to believe the code it produces is identical or possibly less good than Codex.

I enjoy working with Claude in a way I just don’t get from OpenAI. YMMV, you may feel just the opposite. But it’s a mistake to look at the produced code as the only dimension of these products.

show 5 replies
bilekastoday at 2:14 PM

I think for developers the distinction is that ChatGPT is this commercial all in one solution for normies and Claude is specific for developers, in reality as you say the results for normal developers is indistinguishable.

show 1 reply
Frost1xtoday at 3:27 PM

The results are the same but I’ve found the process to get to the results are just more pleasant with Claude. I can’t put my finger on it. Overall most these models at the highest level are about the same in many respects but the UI/UX for some are just more enjoyable, for lack of a better term.

Codex I feel the need to be very specific and precise with. Claude… I feel like I can be lazy, which I enjoy.

Both still need to be reviewed stringently but I feel I can be more ambiguous with Claude and get better results than when Codex.

sebzim4500today at 2:36 PM

I don't think it's marketing, for quite a long time Claude was clearly better and not everyone has adapted to the new reality where they have similar capabilities.

show 2 replies
jesse_dot_idtoday at 5:23 PM

It's a matter of what context is available to me at this time. I like LLMS. They improve my workflow to an insane degree. I think Sam Altman kind of sucks. I don't trust OpenAI. If they were the only kid on the block, I'd use Codex. It's entirely possible Anthropic sucks in the exact ways that OpenAI sucks but has better PR. I don't have time to deep dive to find out. I still like using LLMs. I started using Claude because Cursor, as a company, did something that I can't recall but gave me the ick. So I switched to Claude Code.

I still use Claude Code because I have the most experience with it now, and it's the harness that I understand on a granular level. If something comes along that is clearly better, or if it becomes clear the Codex is miles ahead, I'll try it and evaluate it. To your point, there doesn't seem to be much of a difference.

Arguing over this stuff feels kind of silly, like back in the day when my friends would give me shit for using mIRC instead of ircii or BitchX. I liked the GUI then because I did. I like Claude Code now because I do.

duxuptoday at 6:10 PM

I certainly can’t tell.

I honestly think I’d need weeks of all workday testing to even form an opinion… and some in depth training before that to use each given tool right…

And then … I might decide I can’t tell the difference.

As it is I use Claude and I don’t have the time to properly compare.

AnotherGoodNametoday at 2:53 PM

I don't think that's the only reason but you're spot on about OpenAI marketing being absolutely terrible. The primary product names of "Claude" vs "ChatGPT" highlights this remarkable difference. To the point where I'm seeing Claude completely take over the generic term for agent.

I do think OpenAI is doomed due to bad leadership. What you said (that the marketing is relatively terrible) and what others are saying here (that the product is worse) is damning isn't it? Are they really failing on all fronts?

show 1 reply
comboytoday at 3:05 PM

1. It's 1 in 10 failures that can take half of your time or bugs that can take a long time to surface. Plus the way they change things largely depends on the current codebase (and how it was created)

2. In my case codex seem to be writing a more solid code, but I still use claude most of the time because it's my witty rubber ducky and I can actually sometimes force some legit insights out of it. Codex is much worse at this. And whether that matters or not depends on the project.

yoyohello13today at 3:14 PM

I picked Anthropic way early on, before Claude code even existed. Because they at least play lip service to behaving morally. That’s the most you can hope for these days really.

show 2 replies
bloggietoday at 5:38 PM

Steam and other game stores are pretty much the same but Steam is more popular because every one of their competitors has decided to continually shoot themselves in the foot over and over.

Even if Claude and ChatGPT were exactly the same, Claude would be more popular because OpenAI has decided to make some very unpopular moves and try to make money where popularity isn't required. At the moment that popularity still seems to matter.

kaydubtoday at 5:37 PM

I've always interchangeably used the models.

I don't look at benchmarks.

It's a non-deterministic tool. A lot of the shit going on with LLMs just doesn't make sense to me. All the tooling around like MCPs, they're all just putting stuff into context. So to me the tools aren't really robust and they make little difference.

Lots of AI psychosis going on these days. And I say that as somebody that hasn't written a line of code since Sept 2025

regluoustoday at 2:13 PM

Everyone can be propagandised. It's a matter of pushing the right buttons.

show 3 replies
onesingleblasttoday at 5:51 PM

Newer GPT (5+) models seem to forget imports more often than Claude and use all lowercase comments more (possibly as part of OpenAI's effort to make it more concise).

It also seems to use modern Java features like var and records more.

Hippocratestoday at 5:13 PM

The harness/UI that claude code brought was the thing that stole developer mindshare. Thats when people stopped coding in IDEs. Nothing to do with the underlying model.

pyraletoday at 4:31 PM

> I never want to hear from developers again that they are not susceptible to marketing.

Did you need to come to that conclusion?

Marketing has always been a significant part of new technology adoption. Whether it's for cloud adoption, for new programming languages, for new software development techniques, etc...

jrnicholstoday at 5:40 PM

The funny thing about Tupperware is that some of us have their products from many many years ago and they still work great.

I think we've had the same iced tea pitcher since I was 5 years old, for example. Solid.

Will we be able to say the same thing about Claude?

shepherdjerredtoday at 5:47 PM

> I never want to hear from developers again that they are not susceptible to marketing.

It’s a really good signal of self-awareness/arrogance

jjicetoday at 4:58 PM

I found that the newest opus and 5.5 are definitely close enough where most of the work I do could be done with either. I've seen small differences in planning which I feel like Claude does do better, but I think both products are close enough where I wouldn't be upset if one disappeared.

mgrunwald_today at 3:00 PM

I don't think it's only marketing. OpenAI had the advantage of being first to the market, and in the beginning of the race it seemed that the future belongs to them. Then came the bad PR and unpredictable quality of their main product.

For general use, ChatGPT's answers have gotten worse over the last year. I abandoned it.

scosmantoday at 3:41 PM

Benchmarking 1 or a few samples isn't ever going to yield anything but noise. The actual benchmarks use thousands of tasks.

GPT 5.5 genuinely was back on top for a while there, but if you look at the past 2 years, being on Claude was better than being on OpenAI most of the time. If you're going to pick a tool and not switch constantly it was the right choice. Not to mention their tooling has always been ahead, and that gets ecosystem benefits.

Are they close and interchangeable today? Sure. But Sonnet was genuinely way better than anything OpenAI offered for a long time -- the valuation reflects that, not any given moment in time.

show 1 reply
holistiotoday at 2:17 PM

Been to an Anthropic event in Paris last summer.

They served caviar. It probably had good ROI.

pflenkertoday at 3:35 PM

You confuse ease of using a tool with quality of output. A skilled carpenter can work both with high and with medium quality tools and prefer one over the other with no difference visible in the craft they produce.

__MatrixMan__today at 4:40 PM

It seems we're moving past the point where it's all about model capability. opus4.7 behaves better for me than gpt5.5 because I'm familiar with its idiosyncrasies. Sounds like you've got a good balance between them.

At the end of the day what matters is which team is better, not which model. If Anthropic continues to feel like the good guy, relatively speaking, then people are gonna chose to spend more time getting to know its products and less time with OpenAPI's and on average Anthropic's will be the more capable teams.

I think vibes are gonna matter more and more going forward. The potential for bad behavior on the part of an AI company is severe. We're gonna have to tolerate whoever we enable in this space, so I propose that we make their marketing teams work as hard as possible to show us which will supply better vibes.

PeterStuertoday at 5:34 PM

So you black boxed a few 'success' test, while the main diference between the two is the way they get to the result?

isityettimetoday at 3:08 PM

> i bet 99% of people here, if presented with a test where i gave 5 models but all of the results came from one, would not be able to discern this. Just vibes all the way down.

This is complicated by the way that the coding agents inject prompts that preempt and potentially undermine user instructions. I suspect that one of the reasons Codex works way better for me than Claude Code in certain projects is that the latter adds some garbage like "go ahead and write repetitive copy/paste code, keep it simple, take shortcuts" to every session. A fair test would have to hide but more or less still use the harnesses, not just the models.

show 1 reply
christophilustoday at 3:08 PM

I find codex superior in speed and equal in quality, so it’s my preference. But Claude Code made prettier UIs last time I tested. Codex produces Microsoft-grade UIs. Very enterprise and ugly unless I actively steer it.

bwfan123today at 3:53 PM

> Couldn’t tell.

add deepseek v4 to it, and it will be close at 1/10 th the price. I use all three codex, claude, and deepseek, and they are close.

jjcmtoday at 3:08 PM

Very similar thing happened when I was at a design event a couple of days ago. I’d say it’s even worse on the design end - there was a big discussion around how to optimize your usage of Claude. Not optimize your usage of AI, but Claude specifically, as it was the only model literally all of them were using. The biggest issue is they were all hitting their usage limits. I asked whether they had tried other, lighter models (Ie gemini or composer), and it was like I was speaking a foreign language.

dawnerdtoday at 2:39 PM

Pretty easy to tell depending what the code is. GPT follows this pattern is using maybe_something and using uppercase constants by default. Claude is a little more natural but tends to include more fallbacks than gpt5.5

mewpmewp2today at 2:23 PM

I use both, enough to reach Codex highest personal sub limits and Claude is stronger to me specifically because of how the flow of building feels. So the PR for any random task would be irrelevant to me.

tedivmtoday at 4:58 PM

So you both used Anthropic models (Opus 4.7 being from Anthropic)? I'm struggling to understand what your comparison really was here.

_345today at 3:56 PM

Agree wholeheartedly. I think that Anthropic has just invested more effort in creating a better DevEx than OpenAI, and so people just "feel" that claude code is better but they're about the same really, claude code might be 5% better at best.

vr46today at 2:50 PM

a) everyone is "susceptible" to marketing - so what

b) therefore a preference for Claude is marketing - complete bollocks

Either the tasks you chose were well below the capabilities of top models, or meaningful differences for preference are elsewhere, or both.

Your comment is probably energy-efficient and sustainable, however, because you could use it again and again when another comparison comes up, like Vim vs Emacs, or tea vs coffee

show 1 reply
unshavedyaktoday at 3:17 PM

> Edit: i bet 99% of people here, if presented with a test where i gave 5 models but all of the results came from one, would not be able to discern this. Just vibes all the way down.

I think you're missing one (or more) of the facets individuals decide "better" is, for the subjective individual.

Early on i hopped between all the providers. Code quality for SOTA at the time was pretty decent if you didn't ask it to solve challenging problems. However the thing i found most difficult is consistency in how it listened. Eg Gemini (i forget what version, not current) was super prone to focusing solely on the functionality/goal, but not any of the directions on how to write the code. It would throw in comments everywhere, document in a manner i didn't want, use abstractions i told it not to, etc.

How well a model would follow instructions to drop their horrible "isms" was the #1 criteria for me. If i have to constantly remind the model not to do X behavior then it's a terrible model.

With that said, that is why i chose Claude for the last N months. However i've stuck with Claude because dealing with these "isms" and their little behavioral nuances is a chore in itself. I've found you have to learn the model just as much as anything, and so the idea of hopping these days when i'm just trying to get shit done is not likely.

These days for me personally, Claude has to give me a reason to switch rather than me investing even more money (i'm on the 20x plan) in other providers. I'm definitely not committed to Claude Code, but i am tired of the LLM churn, tooling churn, subscription churn, and the general fear of which providers we can trust.

edit: In short, it's the interactive UX just as much as it is the final output.

andsoitistoday at 3:37 PM

Instead of only hanging them evaluate the final output, you ought to also have a way to have them evaluate the process and agentic aspects in getting to said output. Claude Code outshines when you look at it end-to-end, in my experience.

melenaboijatoday at 2:20 PM

Yes, which means that in the long run this looks ugly.

So much faith and money in this idea, and seeing how fragile it is, does not look good.

theptiptoday at 4:00 PM

Honestly I have no idea how you couldn’t tell. Reading a PR I can see the difference without even reading the words. (I doubt I could spot the difference just looking at the code diffs though.)

Claude commit messages - well structured test plan, readable.

Codex commit messages - wall of text, no structure.

The big difference though is sitting with the tools and using them for work. These are for sure vibes, but I’m sure you could pull out metrics for # steering re-prompts for example.

Codex just goes off and solves the problem, usually comes back with a solve; Claude more often gives up or needs input. Opus gives a broader design discussion, better at conversation. Codex finds deeper/better edge cases.

I think it’s like EMacs vs Vim - you can get your work done with both. There may be some tasks where one is way stronger. A strict “Better” is quite hard to justify.

Ultimately tool choice is a mix of science and art/taste; I want to feel joy using my tools, and fun little pixel explosions make me happy. If a different tool makes you happy, that is also fine.

illwrkstoday at 2:41 PM

Modern Tupperware party. 100% agree! That’s the best framing I’ve heard in a long time!

vjvjvjvjghvtoday at 3:08 PM

The results may be the same but I personally find Claude nicer to work with. It seems to understand my intent better than GPT and needs less guidance. Maybe it’s just personal preference.

wongarsutoday at 2:27 PM

Claude was the best for the longest time. GPT5.5 challenges that, but inertia is real

show 1 reply
rjh29today at 2:26 PM

It's crazy hearing devs on this site claim Claude is 10x better than all other AI solutions. I think it is fomo. Claude $LATEST_VERSION is perceived as the best and anything else is "missing out". New version comes out? Suddenly the old version is worthless, how on earth did anyone get work done with that?

Same reason people buy the RTX 4090 and 5090 cards - overpriced but they must have the "best". Never mind the diminishing returns trying to max out PC settings (3-4x performance hit for an almost imperceptible increase in graphics, ignoring DLSS) - it's the psychological cost of having to move a slider down a notch.

I've been using Google and now DeepSeek v4 and I am having absolutely no problems and it's a fraction of the cost. I'd love for Claude to be 10x better but it just isn't, for my use case anyway.

show 6 replies
logdahltoday at 3:59 PM

A lot is changing. Like 9months ago, I was convinced Claude was best. I'm not so sure anymore :^)

🔗 View 27 more replies