logoalt Hacker News

data-ottawayesterday at 6:48 PM0 repliesview on HN

I reviewed 118 conversations with Claude since March 6, all on real work projects.

Each conversation was processed to assess level of frustration, source of frustration, and evaluated with Gemma 4 and Claude Opus for spot checking. I have a tool I use to manage my work trees, so most work has is done on branches prefixed with ad-hoc/feature/explore or similar, and data was tagged with branch names.

43% of my Claude Code sessions (Opus 4.6, high reasoning) ended with signals of frustration. 73% of total chat time (by total messages) was spent in conversations which were eventually ranked as frustrating.

Median time to frustration was 25 messages, and on average, each message from Claude has about a baseline 5% chance of being frustrating. Frustration by chat length actually matches this 5% baseline of IID Bernoullis -- which is surprising and interesting, as this should not be IID at all.

Frustration types:

- Wrong answers – 14% of sessions, 31% of frustration

- Instruction Following – 11% of sessions, 25% of frustration

- Overcomplication – 8% of sessions, 18% of frustration

- Destructive Actions (e.g. requesting to delete something or commit a change to prod) – 3% of sessions, 8% of frustration

- Non-responsive (service outages leading to non-response) 2% of sessions

- Miscommunication 2% of sessions

- Failed execution 2% of sessions

Half of frustrations happened in the first or last 20% of a chat by length. I interpret early frustrations to be recoverable, late frustrations to be terminal.

Early frustrations (sessions averaged 45 turns):

- 30% overcomplicating the problem

- 30% instruction following issues

- 30% wrong answers

- 10% destructive actions

Late frustrations (sessions averaged 12 turns -- i.e. terminal context early)

- 36% Wrong answers, with repetition

- 21% instruction following, with repeated correction from user (me)

- 14% Service interruptions/outages

- 7% failed execution

- 7% communication - Claude is unable to articulate some result, or understand the problem correctly.

Late frustrations led to the highest levels of frustration, 29% of the time.

I'm a data scientist — my most frustrating work with Claude was data cleaning/repair (a complex backfill) issues -- with 75% of sessions marked frustrating due to overcomplicating, instruction following, or destructive actions).

The best (least frustrating) workflows for DS were code-review, scoped feature work (with tickets), data validation, and config/setup tasks and automation.

Ad-hoc query work ended up in between -- ad-hoc requests were generally bootstrapping queries or doing rough analysis on good data.

Side note: all of my interactions with the /buddy feature were flagged as high frustration ("furious"). That was a false positive over mock arguing with it, but did provide a neat calibration signal. Those sessions were removed entirely from the analysis after classification.