Opus 4.5 is not the normal AI agent experience that I have had thus far

809 points • by tbassetto • last Tuesday at 5:45 PM • 1244 comments • view on HN

Comments

prokopton • last Wednesday at 10:45 AM

I asked Claude’s opinion and it disagreed. :)

Claude’s response:

The article’s central tension is real - Burke went from skeptic to believer by building four increasingly complex apps in rapid succession using Opus 4.5. But his evidence also reveals the limits of that belief.

Notice what he actually built: Windows utilities, a screen recorder, and two Firebase-backed CRUD apps for his wife’s business. These are real applications solving real problems, but they’re also the kinds of projects where you can throw away the code if something goes wrong. When he says “I don’t know how the code works” and “I’m maybe 80% confident these applications are bulletproof,” he’s admitting the core problem with the “AI replaces developers” narrative.

That 80% confidence matters. In your Splink work, you’re the sole frontend developer - you can’t deploy code you’re 80% confident about. You need to understand the implications of your architectural decisions, know where the edge cases are, and maintain the system when requirements change. Burke’s building throwaway prototypes for his wife’s yard sign business. You’re building production software that other people depend on.

His “LLM-first code” philosophy is interesting but backwards. He’s optimizing for AI regeneration rather than human maintenance because he assumes the AI will always be there to fix problems. But AI can’t tell you why a decision was made six months ago when business requirements shift. It can’t explain the constraints that led to a particular architecture. And it definitely can’t navigate political and organizational context when stakeholders disagree about priorities.

The Firebase examples are telling - he keeps emphasizing how well Opus knows the Firebase CLI, as if that proves general capability. But Firebase is extremely well-documented, widely-discussed training data. Try that same experiment with your company’s internal API or a niche library with poor documentation. The model won’t be nearly as capable.

What Burke actually demonstrated is that Opus 4.5 is an excellent pair programmer for prototyping with well-known tools. That’s legitimately valuable. But “pair programmer for prototyping” isn’t the same as “replacing developers.” It’s augmenting someone who already knows how to build software and can evaluate whether the generated code is good.

The most revealing line is at the end: “Just make sure you know where your API keys are.” He’s nervous about security because he doesn’t understand the code. That nervousness is appropriate - it’s the signal that tells you when you’ve crossed from useful tool into dangerous territory.

avidphantasm • last Wednesday at 12:12 AM

Cool. Please check back in with us after they’ve raised the price 50x and you can no longer build anything because you are alienated from your tools.

➕ show 1 reply

MarsIronPI • last Wednesday at 3:57 AM

It worries me that the best models, the ones that can one-shot apps and such, are all non-free and owned by companies who can't be trusted to have end-users' best interests at heart. It would be greatly reassuring to see a self-hostable model that can compete with Opus 4.5 and Gemini 3 at such coding tasks.

smusamashah • last Tuesday at 10:39 PM

What about Sonnet 4.5? I used both Opus and Sonnet on Claude.ai and found sonnet much better at following instructions and doing exactly what was asked.

(it was for single html/js PWA to measure and track heart rate)

Opus seems to go less deep, does it's own things, do not follow instructions exactly EVEN IF I WROTE ALL CAPS. With Sonnet 4.5 I can understand everything author is saying. May be Opus is optimised for Claude code and Sonnet works best on Web.

oldnewthing • last Wednesday at 4:25 AM

Claude Code is very good; good enough that I upgraded to the Max plan this week. However, it has a long way to go. It's great at one-shotting (with iterations) most ideas. However, it doesn't do as well when the task is complicated in an existing codebase. This weekend I migrated the backend for the SaaS I am building from Python to .NET Core. It did the migration but completely missed the conventions that the frontend was using to call the backend. While the converion itself went OK, every user journey was broken. I am still manually testing every code path and feeding in the errors to get Claude to fix it. My instructions were fairly comprehensive but Claude still missed most of it. My fault that I didn't generate tests first, but after this migration that's my first task.

lagniappe • last Tuesday at 9:28 PM

Title is: "Opus 4.5 is going to change everything"

➕ show 1 reply

daxfohl • last Wednesday at 3:29 AM

This resonates with my experience in codex 5.2, at least directionally. I'm pretty persnickety about code itself, so I'm not to the point where I'll just let it rip. But in the last month or two things have gone from "I'll ask on the web interface and maybe copy some code into the project", to trusting the agent and getting a reasonable starting point about half the time.

> because models like to write code WAY more than they like to delete it

Yeah, this is the big one. I haven't figured it out either. New or changing requirements are almost always implemented a flurry of if/else branches all over the place, rather than taking the time for a step back and a reimagining of a cohesive integration of old and new. I've had occasional luck asking for this explicitly, but far more frequently they'll respond with recommendations that are far more mechanical, e.g. "you could extract a function for these two lines of code that you repeat twice", not architectural, in nature. (I still find pasting a bunch of files into the chat interface and iterating on refinements conversationally to be faster and produce better results).

That said, I'm convinced now that it'll get there sooner or later. At that point, I really don't know what purpose SWEs will serve. For a while we might serve as go-betweens between the coding agent and PMs, but LLMs are already way better at translating from tech jargon to human, so I can't imagine it would be long before product starts bypassing us and talking directly to the agents, who (err, which) can respond with various design alternatives, pros and cons of each, identify all the dependencies, possible compatibility concerns, alignment with future direction, migration time, compute cost, user education and adoption tracking, etc, all in real time in fluent PM-ese. IDK what value I add to that equation.

For the last year or so I figured we'd probably hit a wall before AI got to that point, but over the last month or so, I'm convinced it's only a matter of time.

delduca • last Tuesday at 6:08 PM

I agree, it wrote an entire NES emulator for me.

https://news.ycombinator.com/item?id=46443767

➕ show 2 replies

p0w3n3d • last Wednesday at 9:14 AM

  Disclaimer: This post was written by a human and edited for spelling, grammer by Haiku 4.5

I recently am finishing the reading of Mistborn series, so please do not read further unless you want a spoiler.

  SPOILER

There is a suspicion that mists can change written text.

  END OF SPOILER

So how can we be sure that Haiku didn't change the text in favour of AI then?

killerstorm • last Wednesday at 2:31 PM

Weird title. Obviously, early AI agents were clumsy, and we should expect more mature performance in future.

Leopold Aschenbrenner was talking about "unhobbling" as an ongoing process. That's what we are seeing here. Not unexpected

nsb1 • last Wednesday at 3:07 PM

A lot of the complaints about these tools seems to revolve around their current lack of ability to innovate for greenfield or overly complex tasks. I would agree with this assessment in their current state, but this sentiment of "I will only use AI coding tools when they can do 100% of my job" seems short-sighted.

The fact of the matter, in my experience, is that most of the day to day software tasks done by an individual developer are not greenfield, complex tasks. They're boring data-slinging or protocol wrangling. This sort of thing has been done a thousand times by developers everywhere, and frankly there's really no need to do the vast majority of this work again when the AIs have all been trained on this very data.

I have had great success using AIs as vast collections of lego blocks. I don't "vibe code", I "lego code", telling the AI the general shape and letting it assemble the pieces. Does it build garbage sometimes? Sure, but who doesn't from time to time? I'm experienced enough notice the garbage smell and take corrective action or toss it and try again. Could there be strange crevices in a lego-coded application that the AI doesn't quite have a piece for? Absolutely! Write that bit yourself and then get on with your day.

If the only thing you use these tools for is doing simple grunt-work tasks, they're still useful, and dismissing them is, in my opinion, a mistake.

➕ show 1 reply

theappsecguy • last Wednesday at 3:59 AM

It’s incredibly tiring to see this narrative peddled every damn day. I use opus 4.5 every day. It’s not much different than any previous models, still does dumb things all the time.

➕ show 1 reply

sachahjkl • last Wednesday at 1:57 PM

Yowza, AIs excel at writing low performance CRUD apps, REVOLUTION INCOMING

weatherlite • last Wednesday at 7:51 AM

The main issue in this discussion is the word "replace" . People will come up with a bunch of examples where humans are still needed in SWE and can't be fully replaced, that is true. I think claiming that 100% of engineers would be replaced in 2026 is ridiculous. But how about downsizing? Yeah that's quite probable.

brushfoot • last Tuesday at 10:43 PM

I pivoted into integrations in 2022. My day-to-day now is mostly in learning the undocumented quirks of other systems. I turn those into requirements, which I feed to the model du jour via GitHub Copilot Agents. Copilot creates PRs for me to review. I'd say it gets them right the vast majority of the time now.

Example: One of my customers (which I got by Reddit posts, cold calls, having a website, and eventually word of mouth) wanted to do something novel with a vendor in my niche. AI doesn't know how to build it because there's no documentation for the interfaces we needed to use.

kace91 • last Tuesday at 11:51 PM

>Disclaimer: This post was written by a human and edited for spelling, grammer by Haiku 4.5

Either it wasn’t that good, or the author failed in the one phrase they didn’t proofread.

(No judgement meant, it’s just funny).

MORPHOICES • last Wednesday at 11:24 AM

Title: Ask HN: How do you evaluate claims of “this model changes everything” in practice?

The release of every big model seems to carry the identical vibe: finally, this one crossed the line. The greatest programmer. The end of workflows and their meaning.

I’ve learned to slow myself down and ask a different question. What has changed in my day-to-day work after two weeks?

I currently make use of a filter with roughness.

Did it really solve a problem, or did it just make easy parts easier?

Has it lessened the number of choices or has it created new ones?

Have my review responsibilities decreased or increased?

Some things feel revolutionary on day one and then quietly fade into something that’s nice to have. Others barely wow, but stay around. ~

For those who've experienced a couple of cycles.

What indicators suggest that an upcoming release will be significant?

When do you alter your workflow, after how long?

➕ show 1 reply

vl • last Tuesday at 9:41 PM

Honestly, I don’t understand universal praise for Opus 4.5. It’s good, but really not better than other agents.

Just today:

Opus 4.5 Extended Thinking designed psql schema for “stream updates after snapshot” with bugs.

Grok Heavy gave correct solution without explanations.

ChatGPT 5.2 Pro gave correct solution and also explained why simpler way wouldn’t work.

➕ show 1 reply

jcadam • last Wednesday at 7:23 PM

Yea, my issue with Opus 4.5 is it's the first model that's good enough that I'm starting to feel myself slip into laziness. I catch myself reviewing its output less rigorously than I had with previous AI coding assistants.

As a side project / experiment, I designed a language spec and am using (mostly) Opus 4.5 to write a transpiler (language transpiles to C) for it. Parser was no problem (I used s-expressions for a reason). The type checker and transpiler itself have been a slog - I think I'm finding the limits of Opus :D. It particularly struggles with multi-module support. Though, some of this is probably mistakes made by me while playing architect and iterating with Claude - I haven't written a compiler since my senior year compiler design course 20+ years ago. Someone who does this for a living would probably have an easier time of it.

But for the CRUD stuff my day job has me doing? Pffttt... it's great.

yardie • last Tuesday at 6:25 PM

These are very simple utilities. I expect AI to be able to build them easily. Maybe in a few years it will be able to write a complete photo editor or CAD application from first principles.

➕ show 1 reply

jackdoe • last Wednesday at 5:15 AM

most of software engineering was rational, now it is becoming empirical

it is quite strange, you have to make it write the code in a way it can reason about it without it reading it, you also have to feel the code without reading all of it. like a blind man feeling the shape of an object; Shape from Darkness

you can ask opus to make a car, it will give you a car, then you ask it for navigation; no problem, it uses google maps works perfect

then you ask it to improve the breaks, and it will give internet to the tires and the break pedal, and the pedal will send a signal via ipv6 to the tires which will enable a very well designed local breaking system, why not, we already have internet for google maps.

i think the new software engineering is 10 times harder than the old one :)

manmal • last Tuesday at 6:13 PM

IMO codex produces working code slowly, while Opus produces superficially working code quickly. I like using Opus to drive codex sessions and checking its output. Clawdbot is really good at that but a long running Claude Code session with codex as sub agents should work well also.

The above is for vibe coding; for taking the wheel, I can only use Opus because I suck at prompting codex (it needs very specific instructions), and codex is also way too slow for pair programming.

➕ show 1 reply

jdthedisciple • last Wednesday at 7:43 AM

To those of you who use it: How much does Claude Code cost you a month on avg?

I only use VS Code with Copilot subscription ($10) and already get quite a lot out of it.

My experience is that Claude Code really drains your pocket extremely fast.

➕ show 1 reply

minimaxir • last Tuesday at 6:10 PM

See also: a post from a couple days ago which came to the same conclusion that Opus 4.5 is an inflection point above Sonnet 4.5 despite that conclusion being counterintuitive: https://news.ycombinator.com/item?id=46495539

It's hard to say if Opus 4.5 itself will change everything given the cost/latency issues, but now that all the labs will have very good synthetic agentic data thanks to Opus 4.5, I will be very interested to see what the LLMs release this year will be able to do. A Sonnet 4.7 that can do agentic coding as well as Opus 4.5 but at Sonnet's speed/price would be the real gamechanger: with Claude Code on the $20/mo plan, you can barely do more than one or two prompts with Opus 4.5 per session.

hsn915 • last Wednesday at 12:22 AM

I had a similar feeling expressed in the title regarding ChatGPT 5.2

I haven't tried it for coding. I'm just talking about regular chatting.

It's doing something different from prior models. It seems like it can maintain structural coherence even for very long chats.

Where as prior models felt like System 1 thinking, ChatGPT5.2 appears like it exhibits System 2 thinking.

karmasimida • last Tuesday at 9:50 PM

As impressive as Opus 4.5 is, it still fails in one situation that it assumes 0-index while the component it supposes to work with assume 1-index. It has access to the said information on disk, but just forgets to look into.

Opus 4.5 is incredible, it is the GPT-4 moment for coding because how honest and noticeable the capacity increase is. But still, it has blind spots just like human.

Sxubas • last Wednesday at 3:18 AM

Just an open thought, what if most improvement we are seeing is not mostly due to LLM improvements but to context management and better prompting?

Ofc the reality is a mix of both, but really curious on what contributes more.

Probably just using cursor with old models (eww) can yield a quick response.

jcmfernandes • last Tuesday at 6:22 PM

To the author: you wrote those apps. Not like you used to, but you wrote them.

IMO, our jobs are safe. It's our ways of working that are changing. Rapidly.

➕ show 1 reply

Snuggly73 • last Tuesday at 7:43 PM

Ok, if its almighty, then why is not the benchmarks at 100%? If you look at the individual issues, those are somewhat small and trivial changes in existing codebases.

https://swe-rebench.com/

(note that if you look at individual slices, Opus is getting often outperformed by Sonnet).

waynenilsen • last Tuesday at 6:01 PM

Once you get your setup bulletproof such that you can have multiple agents running at the same time that can run unit tests and close their own loops things get even faster. However you accomplish that. Not as easy as it sounds mostly (and absurdly) due to port collision. E2E testing with playwright is another leap.

➕ show 2 replies

ggregoire • last Wednesday at 6:01 PM

I'm always surprised to never see any comments in those discussions from people who just like coding, learning, solving problems… I mean, it's amazing that LLMs can build an image converter or whatever you dream of, in a language you don't know, in a field you are not familiar with, in 1 hour, for 30 cents… I'm sure your boss and shareholders love it. But where is the fun in that? For me it kills any interest in doing what I'm doing. I'm lucky enough to work in a place where using LLMs is not mandatory (yet), I don't know how people can make it through the day just writing prompts and reviewing AI slop.

➕ show 1 reply

_se • last Tuesday at 6:19 PM

It's always fun to ask Opus what it thinks about articles like this. Here's what I got with no history or system prompt:

https://burkeholland.github.io/posts/opus-4-5-change-everyth...

Read this article and ultrathink critically about it. Provide your perspective.

The article makes a strong experiential case for improved velocity in AI-assisted development but contains several argumentation weaknesses and conflations worth examining.

The projects described are legitimately non-trivial: Firebase backend integration, Facebook OAuth, iOS apps in Swift (a language the author doesn't know), GitHub Actions pipelines, scheduled cloud functions. Getting these working in hours rather than weeks represents a real capability shift. The author is honest about his limitations and uncertainties, particularly the security concerns.

Where the argument breaks down:

1. "Replace developers" vs "dramatically augment developers" The author's own workflow contradicts the headline claim. He's still:

Making architectural decisions (choosing Firebase) Handling errors Opus couldn't see (XAML via Visual Studio) Writing custom prompts to shape output quality Manually auditing security Making product and UX decisions

This is developer work. The tool changed; the role didn't disappear.

2. The 80% security confidence undermines his thesis He admits he's shipping apps with "80% confidence" in security and calls it "too damn low." This is the crux: the AI accelerated production but didn't replace the judgment required to responsibly ship production software. The velocity gain exposed a competence gap rather than closing it.

3. Sample bias in project selection

All examples are:

Greenfield (no existing codebase) Single developer Personal/family use Standard patterns with excellent documentation (Firebase, SwiftUI, React Native) No regulatory, compliance, or scale requirements No team collaboration or handoff considerations

These constraints define a specific problem space where AI excels. Extrapolating to "developers are replaceable" ignores the majority of professional software work.

4. "Code doesn't need human readability" is underbaked His argument is circular: "Why optimize for human readability when the AI is doing all the work?" But:

His 80% security confidence exists because he can't read the code He had to use external tools (VS) when Opus couldn't diagnose errors What happens when context windows are exceeded and the LLM loses track? Model behavior changes between versions; human-readable code is version-agnostic

The custom prompt he shares actually encodes many good engineering practices (minimal coupling, explicit state, linear control flow) that benefit LLMs and humans. The "no comments needed" claim conflates what's optimal for LLM regeneration with what's optimal for debugging production issues at 3am. What's actually being demonstrated

The honest version of this article would be: Opus 4.5 dramatically compresses the gap between "can write code" and "can ship a personal app" for a specific class of greenfield projects. That's genuinely transformative for hobbyists, indie developers, and people solving their own problems. But that's different from "replacing developers." The article demonstrates a power tool; power tools don't eliminate tradespeople.

➕ show 1 reply

Fischgericht • last Wednesday at 12:04 PM

People should finally understand that LLMs are a lossy database of PAST knowledge. Yes, if you throw a task at it that has been done tons of times before, it works. Which is not a surprise, because it takes minutes to Google and index multiple full implementations of "Tool that allows you to right-click on an image to convert it". Without LLM you could do the same: Just copy&paste the implementation of that from Microsoft Powertoys, for example.

What LLMs will NOT do however, is write or invent SOMETHING KNEW.

And parts of our industry still are about that: Writing Software that has NOT been written before.

If you hire junior developers to re-invent the wheels: Sure, you do not need them anymore.

But sooner or later you will run out of people who know how to invent NEW things.

So: This is one more of those posts that completely miss the point. "Oh wow, if I look up on Wikipedia how to make pancakes I suddenly can make and have pancakes!!!1". That always was possible. Yes, you now can even get an LLM to create you a pancake-machine. Great.

Most of the artists and designers I am friends with have lost their jobs by now. In a couple of years you will notice the LLMs no longer have new styles to copy from.

I am all for the "remix culture". But don't claim to be an original artist, if you are just doing a remix. And LLM source code output are remixes, not original art.

➕ show 1 reply

arielweisberg • last Wednesday at 2:10 AM

I agree. Claude Code went from being slower than doing it myself to being on average faster, but also far less exhausting so I can do more things in general while it works.

orthoxerox • last Tuesday at 6:18 PM

What's the best coding agent you can run locally? How far behind Opus 4.5 is it?

➕ show 1 reply

_pdp_ • last Wednesday at 1:13 AM

YEP

Things are changing. Now everyone can build bespoke apps. Are these apps pushing the limits of technology? No! But they work for the very narrow and specific domain they where designed. And yes they do not scale and have as much bugs as your personal shell scripts. But they work.

But let's not compare these with something more advance - at least not yet. Maybe by end of this year?

We switched from Sonnet 4.5 to Opus 4.5 as our default coding agent recently and we pay the price for the switch (3x the cost) but as the OP said, it is quite frankly amazing. It does a pretty good job, especially, especially when your code and project is structured in a such a way that it helps the agent perform well. Anthropic released an entire video on the subject recently which aligns with my own observations as well.

Where it fails hard is in the more subtle areas of the code, like good design, best practices, good taste, dry, etc. We often need to prompt it to refactor things as the quick solution it decided to do is not in our best interest for the long run. It often ends in deep investigations about things which are trivially obvious. It is overfitted to use unix tools in their pure form as it fail to remember (even with prompting) that it should run `pnpm test:unit` instead `npx jest` - it gets it wrong every time.

But when it works - it is wonderful.

I think we are at the point where we are close to self-improving software and I don't mean this lightly.

It turns out the unix philosophy runs deep. We are right now working on ways to give our agents more shells and we are frankly a few iterations there. I am not sure what to expect after this but I think whatever it is, it will be interesting to see.

adithyassekhar • last Wednesday at 1:52 AM

I like writing code

vladsh • last Wednesday at 2:52 PM

It’s a bit strange how anecdotes have become acceptable fuel for 1000 comment technical debates.

I’ve always liked the quote that sufficiently advanced tech looks like magic, but its mistake to assume that things that look like magic also share other properties of magic. They don’t.

Software engineering spans over several distinct skills: forming logical plans, encoding them in machine executable form(coding), making them readable and expandable by other humans(to scale engineering), and constantly navigating tradeoffs like performance, maintainability and org constraints as requirements evolve.

LLMs are very good at some of these, especially instruction following within well known methodologies. That’s real progress, and it will be productized sooner than later, having concrete usecases, ROI and clearly defined end user.

Yet, I’d love to see less discussion driven by anecdotes and more discussion about productizing these tools, where they work, usage methodologies, missing tooling, KPIs for specific usecases. And don’t get me started on current evaluation frameworks, they become increasingly irrelevant once models are good enough at instruction following.

➕ show 5 replies

chris_st • last Wednesday at 12:24 AM

I've found asking GPT-5.2 High to review Opus 4.5's code to be really productive. They find different things.

ironbound • last Tuesday at 11:21 PM

This is great can't wait for the future when our VC ideas can become unicorns, without CEO's & Founders..

Papazsazsa • last Tuesday at 6:06 PM

"Opus 4.5 feels to me like"

The article is fine opinion but at what point are we going to either:

a) establish benchmarks that make sense and are reliable, or

b) stop with the hypecycle stuff?

➕ show 2 replies

DGAP • last Wednesday at 5:11 PM

Time to get a new job.

rubzah • last Wednesday at 2:35 PM

Once again. It is not greenfield projects most of us want to use AI coding assistance for. It is for an existing project, with a byzantine mess of a codebase, and even worse messes of infrastructure, business requirements, regulations, processes, and God knows what else. It seems impossible to me that AI would ever be useful in these contexts (which, again, are practically all I ever deal with as a professional in software development).

thallukrish • last Wednesday at 4:38 AM

When complexity increases, you end up handholding them in pieces.

bluelightning2k • last Wednesday at 11:11 AM

The harness here was Claude Code?

exabrial • last Wednesday at 7:04 AM

What is with all the Claude spam lately on hn?

alt Hacker News

Opus 4.5 is not the normal AI agent experience that I have had thus far

Comments

🔗 View 27 more comments