The problem with this is none of this is production quality. You haven’t done edge case testing for...

tannedNerd • last Tuesday at 5:54 PM • 11 replies • view on HN

The problem with this is none of this is production quality. You haven’t done edge case testing for user mistakes, a security audit, or even just maintainability.

Yes opus 4.5 seems great but most of the time it tries to vastly over complicate a solution. Its answer will be 10x harder to maintain and debug than the simpler solution a human would have created by thinking about the constraints of keeping code working.

Replies

structural • last Tuesday at 6:26 PM

Yes, but my junior coworkers also don't reliably do edge case testing for user errors either unless specifically tasked to do so, likely with a checklist of specific kinds of user errors they need to check for.

And it turns out the quality of output you get from both the humans and the models is highly correlated with the quality of the specification you write before you start coding.

Letting a model run amok within the constraints of your spec is actually great for specification development! You get instant feedback of what you wrongly specified or underspecified. On top of this, you learn how to write specifications where critical information that needs to be used together isn't spread across thousands of pages - thinking about context windows when writing documentation is useful for both human and AI consumers.

➕ show 2 replies

pseudosavant • last Tuesday at 9:02 PM

Isn't it though? I've worked with plenty of devs who shipped much lower quality code into production than I see Claude 4.5 or GPT 5.2 write. I find that SOTA models are more likely to: write tests, leave helpful comments, name variables in meaningful ways, check if the build succeeds, etc.

Stuff that seems basic, but that I haven't always been able to count on in my teams' "production" code.

jonas21 • last Tuesday at 6:12 PM

I can generally get maintainable results simply by telling Claude "Please keep the code as simple as possible. I plan on extending this later so readability is critical."

➕ show 1 reply

maherbeg • last Tuesday at 6:01 PM

That may be true now, but think about how far we've come in a year alone! This is really impressive, and even if the models don't improve, someone will build skills to attack these specific scenarios.

Over time, I imagine even cloud providers, app stores etc can start doing automated security scanning for these types of failure modes, or give a more restricted version of the experience to ensure safety too.

➕ show 2 replies

bgirard • last Tuesday at 6:14 PM

It's not from a few prompts, you're right. But if you layer on some follow-up prompts to add proper test suits, run some QA, etc... then the quality gets better.

I predict in 2026 we're going to see agents get better at running their own QA, and also get better at not just disabling failing tests. We'll continue to see advancements that will improve quality.

➕ show 1 reply

cyberpunk • last Tuesday at 6:15 PM

You should try it with BEAM languages and the 'let it crash' style of programming. With pattern matching and process isolated per request you basically only need to code the happy path, and if garbage comes in you just let the process crash. Combined with the TDD plugin (bit of a hidden gem), you can absolutely write production level services this way.

➕ show 2 replies

LatencyKills • last Tuesday at 6:00 PM

Agree... but that is exactly what MVPs are. Humans have been shipping MVPs while calling them production-ready for decades.

adriand • last Tuesday at 6:11 PM

> Its answer will be 10x harder to maintain and debug

Maintain and debug by who? It's just going to be Opus 4.5 (and 4.6...and 5...etc.) that are maintaining and debugging it. And I don't think it minds, and I also think it will be quite good at it.

aschobel • last Tuesday at 6:21 PM

there is are skills / subagents for that

something like code-simplifier is surprisingly useful (as is /review)

https://x.com/bcherny/status/2007179850139000872

joelthelion • last Tuesday at 9:12 PM

Depends on the application. In many cases it's good enough.

mikert89 • last Tuesday at 9:35 PM

Its so much easier to create production quality software

alt Hacker News

Replies