Issue: Claude Code is unusable for complex engineering tasks with Feb updates

1316 points • by StanAngeloff • last Monday at 1:50 PM • 727 comments • view on HN

Comments

hilarious that there's 10 billion lines of context being shuffled around and argued over but paying a dev 100K is a techno sin. Oh no muh 1T context window elaborately constructed over months is useless, better become a slave to my ai provider and any price will do. plz write my code but for free but for all my company's value.

➕ show 1 reply

data-ottawa • yesterday at 6:48 PM

I reviewed 118 conversations with Claude since March 6, all on real work projects.

Each conversation was processed to assess level of frustration, source of frustration, and evaluated with Gemma 4 and Claude Opus for spot checking. I have a tool I use to manage my work trees, so most work has is done on branches prefixed with ad-hoc/feature/explore or similar, and data was tagged with branch names.

43% of my Claude Code sessions (Opus 4.6, high reasoning) ended with signals of frustration. 73% of total chat time (by total messages) was spent in conversations which were eventually ranked as frustrating.

Median time to frustration was 25 messages, and on average, each message from Claude has about a baseline 5% chance of being frustrating. Frustration by chat length actually matches this 5% baseline of IID Bernoullis -- which is surprising and interesting, as this should not be IID at all.

Frustration types:

- Wrong answers – 14% of sessions, 31% of frustration

- Instruction Following – 11% of sessions, 25% of frustration

- Overcomplication – 8% of sessions, 18% of frustration

- Destructive Actions (e.g. requesting to delete something or commit a change to prod) – 3% of sessions, 8% of frustration

- Non-responsive (service outages leading to non-response) 2% of sessions

- Miscommunication 2% of sessions

- Failed execution 2% of sessions

Half of frustrations happened in the first or last 20% of a chat by length. I interpret early frustrations to be recoverable, late frustrations to be terminal.

Early frustrations (sessions averaged 45 turns):

- 30% overcomplicating the problem

- 30% instruction following issues

- 30% wrong answers

- 10% destructive actions

Late frustrations (sessions averaged 12 turns -- i.e. terminal context early)

- 36% Wrong answers, with repetition

- 21% instruction following, with repeated correction from user (me)

- 14% Service interruptions/outages

- 7% failed execution

- 7% communication - Claude is unable to articulate some result, or understand the problem correctly.

Late frustrations led to the highest levels of frustration, 29% of the time.

I'm a data scientist — my most frustrating work with Claude was data cleaning/repair (a complex backfill) issues -- with 75% of sessions marked frustrating due to overcomplicating, instruction following, or destructive actions).

The best (least frustrating) workflows for DS were code-review, scoped feature work (with tickets), data validation, and config/setup tasks and automation.

Ad-hoc query work ended up in between -- ad-hoc requests were generally bootstrapping queries or doing rough analysis on good data.

Side note: all of my interactions with the /buddy feature were flagged as high frustration ("furious"). That was a false positive over mock arguing with it, but did provide a neat calibration signal. Those sessions were removed entirely from the analysis after classification.

tinyhouse • last Monday at 6:27 PM

I highly recommend everyone to use Pi - it's simpler and better harness. The only tricky part is that moving forward you cannot use the Claude subscription to access Opus. But for many tasks there are enough alternatives.

slopinthebag • last Monday at 6:08 PM

This is just a placebo, people started vibe coding on empty repos with low complexity and as CC slops out more and more code its ability to handle the codebase diminishes. Gradually at first, and then suddenly.

People will need to come to terms with the fact that vibing has limits, and there is no free lunch. You will pay eventually.

SilverSlash • yesterday at 9:27 AM

I'm deeply regretting paying for this service right now. There is some gaslighting going on in that issue that it's because of the 1M context model. I am using the non-1M context model and it's still disastrously bad.

mrcwinn • last Monday at 4:42 PM

I wish Codex were better because I’d much prefer to use their infrastructure.

➕ show 1 reply

gib444 • yesterday at 7:26 AM

You will build nothing and you'll be happy.

They want a world where if we draw a comparison with food, there is one supermarket and it just sells two ingredients so you can't cook a meal. McDonald's etc flourish

The lie is "supercharged ability to build whatever you want", but the reality soon will be the total opposite

Look at how many people have zero cooking skills these days

porridgeraisin • yesterday at 5:52 AM

IMO, it's an expectations vs reality thing.

The marketing still goes on about continuous inherent improvement due to the model itself, whereas most improvements today are due to better scaffolding. The key now is to build tooling around these LLMs to make them reliably productive - whatever level that may be at.

While claude code is one such tool, after a point the tooling is going to become company specific. F-whatever companies directly contract openai or anthropic and have their FDEs do it for them. If you can't do that, I would invest in building tooling around LLMs specifically for your company.

Note that LLMs are approximate retrieval machines. You still need a planner* and a verifier around it. Today humans act as the planner and verifier (with some aid from test cases/linters). Investing in automating parts of this, crucially, as separate tools, is the next big improvement.

* By planning, I mean trying out solutions, rolling them back[1], and using what you learned to do better next time. The solution search process. Context management also falls under this.

[1] and no, LLMs going "wait no..." doesn't count.

➕ show 1 reply

kabir_daki • last Monday at 9:17 PM

"Interesting perspective. I've found Claude useful for building straightforward web tools, but agree it struggles with complex multi-file refactoring."

citizenpaul • last Monday at 6:04 PM

I think its all a reflection of the price. To make AI/LLM's useful you have to burn A LOT of tokens. Way more than people are willing to pay for.

Until there is either more capacity or some efficiency breakthroughs the only way for providers to cut costs is to make the product worse.

desireco42 • last Monday at 5:06 PM

I've been using OpenCode and Codex and was just fine. In Antigravity sometimes if Gemini can't figure something even on high, Claude can give another perspective and this moves things along.

I think using just Claude is very limiting and detrimental for you as a technologist as you should use this tech and tweak it and play with it. They want to be like Apple, shut up and give us your money.

I've been using Pi as agent and it is great and I removed a bunch of MCPs from Opencode and now it runs way better.

Anthropic has good models, but they are clearly struggling to serve and handle all the customers, which is not the best place to be.

I think as a technologist, I would love a client with huge codebase. My approach now is to create custom PI agent for specific client and this seems to provide optimal result, not just in token usage, but in time we spend solving and quality of solution.

Get another engine as a backup, you will be more happy.

➕ show 1 reply

zsoltkacsandi • last Monday at 4:54 PM

This has been an ongoing issue much longer than since February.

howmayiannoyyou • last Monday at 4:18 PM

Not just engineering. Errors, delays and limits piling up for me across API and OAuth use. Just now:

Unable to start session. The authentication server returned an error (500). You can try again.

ThrowawayR2 • last Monday at 7:04 PM

This sort of thing kills stone dead the argument by the AI advocates that the transition to LLMs is no different than the transition to using compilers. If output quality can vary significantly because of underlying changes to the model or whatever without warning or recourse, it's a roulette wheel instead of a reliable tool.

raincole • last Monday at 5:44 PM

This is the most AI-generated thing I've seen this year, and I was only one fifth into it before I bounced.

Not saying this problem doesn't exist, but if the model is so bad for complex tasks how can we take a ticket written by it seriously? Or this author used ChatGPT to write this? (that'd be quite some ironic value, admittedly)

dorianmariecom • last Monday at 4:34 PM

codex wins :)

russli1993 • last Monday at 4:56 PM

Lol, software company execs didn't see this coming. Fire all your experienced devs to jump on Anthropic bandwagon. Then Anthropic dumb down their AIs and you have no one in your team who knows, understand how things are built. Your entire company goes down. Your entire company's operation depends on the whims of Anthropic. If Anthropic raises prices by 10% per year, you have to eat it. This is what you get when you don't respect human beings and human talent.

traderaegis • yesterday at 2:43 PM

[dead]

glasswerkskimny • yesterday at 4:16 AM

[flagged]

alexchen_dev • yesterday at 2:17 AM

[flagged]

getverdict • yesterday at 2:00 AM

[dead]

neuzhou • yesterday at 10:51 AM

[dead]

edinetdb • yesterday at 4:42 AM

[flagged]

claudexai • yesterday at 7:02 AM

[dead]

robstertalk • yesterday at 9:15 AM

[flagged]

sharkjacobs • last Monday at 5:57 PM

[dead]

ssz7820 • yesterday at 6:04 PM

[dead]

areys • yesterday at 6:31 AM

[flagged]

lyncyan • today at 12:08 AM

[flagged]

ryguz • last Monday at 4:26 PM

[dead]

nightrate_ai • last Monday at 10:32 PM

[dead]

aplomb1026 • last Monday at 5:31 PM

[dead]

BrianFHearn • yesterday at 1:17 PM

[dead]

SkyPuncher • last Monday at 4:53 PM

[dead]

sillyboi • yesterday at 3:31 PM

[dead]

mahadillah-ai • yesterday at 2:09 AM

[dead]

giant-broccoli • yesterday at 2:01 PM

[dead]

sickcodebruh • last Monday at 5:21 PM

[dead]

adonese • last Monday at 4:27 PM

Things had went downhill since they removed ultrathink /s

➕ show 1 reply

lpcvoid • last Monday at 7:13 PM

[flagged]

_V_ • last Monday at 4:16 PM

[flagged]

➕ show 1 reply

ianberdin • last Monday at 8:50 PM

I use it ultra extensively and it works absolutely fantastic. Sometimes I think: "people are right, it is worse now" and then realize it is mistake, poor context or poor prompt. Garbage in, garbage out. No, it works not worse, but better.

I built entire AI website builder https://playcode.io using it, alone. 700K LOKs total. It also uses Opus. So believe me, I know how it works. Trick is simple: never ever expect it finds necessary files. Always provide yourself. Always.

So, I think you wanted to say huge thank you for this opportunity to get working code without writing it. Insane times, insane.

Huge thanks for 1M context window included to Max subscription.

➕ show 3 replies

alt Hacker News

Issue: Claude Code is unusable for complex engineering tasks with Feb updates

Comments