logoalt Hacker News

AI helps ship faster but it produces 1.7× more bugs

187 pointsby birdcultureyesterday at 1:06 PM147 commentsview on HN

Comments

tyleoyesterday at 1:50 PM

I have a theory that vibe coding existed before AI.

I’ve worked with plenty of developers who are happy to slam null checks everywhere to solve NREs with no thought to why the object is null, should it even be null here, etc. There’s just a vibe that the null check works and solves the problem at hand.

I actually think a few folks like this can be valuable around the edges of software but whole systems built like this are a nightmare to work on. IMO AI vibe coding is an accelerant on this style of not knowing why something works but seeing what you want on the screen.

show 12 replies
cmiles8yesterday at 2:07 PM

There are certainly some valid criticisms of vibe coding. That said, it’s not like the quality of most code was amazing before AI came along. In fact, most code is generally pretty terrible and took far too long for teams to ship.

Many folks would say that if shipping faster allows for faster iterations across an idea then the silly errors are worth it. I’ve certainly seen a sharp increase on execs calling BS on dev teams saying they need months to develop some basic thing.

show 7 replies
gwbas1cyesterday at 7:32 PM

A few days ago I implemented IComparable in .Net / C#, and copilot "read my mind" and made a suggestion. It saved me 2-3 minutes of typing.

It took me about an hour of debugging to realize that copilot swapped a variable prefixed with "x" and a different one prefixed with "y".

If I wrote it myself, I wouldn't have made that mistake. But, otherwise copilot wrote the exact code that I was going to write.

0x3fyesterday at 1:59 PM

At best this would be 1.7x more _discovered_ bugs. The average PR (IMO) is hardly checked. AI could have 10x as many real issues on PRs, but we're just bad at reviewing PRs.

show 1 reply
alexgotoiyesterday at 7:02 PM

The pattern here feels pretty old: every time something shows up that lets people go much faster, we use it to crash harder first. When cars showed up, people didn’t suddenly become more careful because they could now move at 50 km/h instead of 5 – they just plowed into things faster until seatbelts, traffic rules and driver training caught up.

LLMs in coding feel similar. They don’t magically remove the need for tests, specs, and review; they just compress the time between “idea” and “running code” so much that all the missing process shows up as outages instead of slow PRs. The risk isn’t “AI writes code”, it’s orgs refusing to slow down long enough to build the equivalent of traffic lights and driver’s ed around it.

bodge5000yesterday at 1:59 PM

As has already been said, we've been here before. I could ship significantly faster if I ignored any error handling or edge cases and basically just assumed the data would flow 100% how I expect it to all the time. Of course that is almost never the case, so I'd end up with more bugs.

I'd like to say that AI just takes this to an extreme but I'm not even sure about that, I think it could produce more code and more bugs than I could in the same amount of time but not significantly so if I just gave up on caring about anything

show 1 reply
bogzzyesterday at 1:48 PM

oh wow, an LLM-based company with an article that claims AI is oddly not as bad when it comes to generating gobbledegook as everyday empirical evidence should suggest

show 1 reply
kkarpkkarpyesterday at 5:01 PM

I can't find if they deducted false-positives before they count the results. I've been using CodeRabbit heavily and like any other AI code reviewing tools it was having a lot of them.

Like for example: found missing data validation / sanitization reported, only because the code has already been sanitized / validated but this is not visible in the diff.

You can tell CodeRabbit he is wrong about this and the tool accepts it then, though.

neallindsayyesterday at 2:16 PM

1.7x more is not the same as 1.7x as many.

show 1 reply
yomismoaquiyesterday at 2:06 PM

Agentic AI coding is a tool, you can use it wrong.

To give an example of how to use AI successfully check the following post:

https://friendlybit.com/python/writing-justhtml-with-coding-...

show 1 reply
LurkandCommentyesterday at 5:17 PM

Sorry guys I'm having trouble here:

Is it: AI builds product faster but with more bugs in production (Adds overall time to acceptable production)

or

Is is: Al helps us build faster, enough though we have to fix more bugs before production. (over all less time to acceptable production, but specifically fixing bugs takes longer)

show 1 reply
yonibotyesterday at 6:36 PM

How useful is this metric if we don't know which LLM produced each MR?

There could be massive differences in quality between LLMs.

show 1 reply
nerdjonyesterday at 1:59 PM

Something I have been very curious about for some time now. We know the quality of the code is not very high and has a high likelihood of bugs.

But, assuming there are not bugs and the code ships. Has there been any study in resource usage creeping up and an impact of this on a whole system. The tests I have done with trying to build things with AI it always seems like there is zero efficiency unless you notice it and can put it in the right direction.

I have been curious about the impact this will have on general computing as more low quality code makes it into applications we use every day.

brainlessyesterday at 2:21 PM

I use LLMs to generate almost all my code. Currently at 40K lines of Rust, backend and a desktop app. I am a senior engineer with almost all my tech career (16 years) in startups.

Coding with agents has forced me to generate more tests than we do in most startups, think through more things than we get the time to do in most startups, create more granular tasks and maintain CI/CD (my pipelines are failing and I need to fix them urgently).

These are all good things.

I have started thinking through my patterns to generate unit tests. I was generating mostly integration or end to end tests before. I started using helping functions in API handlers and have unit tests for helpers, bypassing the API level arguments (so not API mocking or framework test to deal with). I started breaking tasks down into smaller units, so I can pass on to a cheaper model.

There are a few patterns in my prompts but nothing that feels out of place. I do not use agents files and no MCPs. All sources here: https://github.com/brainless/nocodo (the product is itself going through a pivot so there is that).

show 3 replies
sailfastyesterday at 3:34 PM

How many more bugs does it produce if we use CodeRabbit to review PRs? I assume the number will be less? (Asking seriously and hopefully if the product will help or would’ve caught the bugs, while also pointing out the natural conclusion of the article is to purchase your service :) )

show 1 reply
strangescriptyesterday at 1:53 PM

Do they consider code readability, formatting and variable naming as "errors" for the overall count. That seems dubious given where we are headed.

No one cares what a compiler or js minifier names its variables in its output.

Yes, if you don't believe we will get there ever, then this is totally valid complaint. You are also wrong about the future.

show 1 reply
phartenfelleryesterday at 1:52 PM

Definitely. But AI can also generate unit tests.

You have to be careful by exactly telling the LLM what to test for and manually check the whole suite of tests. But overall it makese feel way more confident over increasing amounts of generated code. This of course decreases the productivity gains but is necessary in my opinion.

And linters help.

show 2 replies
windexyesterday at 2:04 PM

I think devs have now split into two camps, the kvetchers and the shippers. It's a new tool, it's fresh. Things will work itself out over the next couple of years/months(?). The kvetching helps keep AI research focused on the problem which is good. Meanwhile continue to ship.

lherronyesterday at 2:11 PM

They buried the lede. The last half of the article with ways to ground your dev environment to reduce the most common issues should be its own article. (However implementing the proper techniques somewhat obviates the need for CodeRabbit, so guess it’s understandable.)

exitbyesterday at 2:09 PM

1.7x does not look that bad? If "AI code" is a broad classification that includes people using bad tools, or not being very skilful operators of said tools, then we can expect this number to meaningfully improve over time.

show 1 reply
cgearhartyesterday at 1:54 PM

So…great for prototyping (where velocity rules) but somewhere between mixed to negative for critical projects. Seems like this just puts some mildly quantitative numbers behind the consensus & trends I see emerging.

show 1 reply
nphardonyesterday at 5:49 PM

Ship fast, let customer do the QA, is that the idea?

everdriveyesterday at 1:56 PM

Sounds like what companies have been scrambling for this whole time. People just want to dump something out there. They don't really care if it works very well.

stevenfosteryesterday at 6:42 PM

Someone solved this 13 years ago: https://github.com/mattdiamond/fuckitjs

kristopherleadsyesterday at 3:30 PM

I really think the answer here is human-in-the-loop. Too many people are thinking that AI is a full on drop-in replacement for engineers or managers, but ultimately having it be an augment is the magic. I work at FlowFuse so super biased, but that's something I've really enjoyed with our MCP and Expert Assistant - it's built to help you, not to replace you, so you can ask questions, get insights, etc. faster.

I suppose the tl;dr is if you're generating bugs in your flow and they make it to prod, it's not a tool problem - it's a cultural one.

visargayesterday at 4:57 PM

AI helps ship faster but we need to code 1.7x more tests (with AI) and it's allright

827ayesterday at 3:22 PM

Archetypes of prompts that I find AI to be quite good at handling:

1. "Write a couple lines or a function that is pretty much what four years ago I would have gone to npm to solve" (e.g. "find the md5 hash of this blob")

2. "Write a function that is highly represented and sampleable in the rest of the project" (e.g. "write a function to query all posts in the database by author_id" (which might include app-specific steps like typing it into a data model)).

3. "Make this isolated needle-in-a-haystack change" (e.g. "change the text of such-and-such tooltip to XYZ") (e.g. "there's a bug with uploading files where we aren't writing the size of the file to the database, fix that")

I've found that it can definitely do wider-ranging tasks than that (e.g. implement all API routes for this new data type per this description of the resource type and desired routes); and it can absolutely work. But, the two problems I run into:

1. Because I don't necessarily have a grokable handle on what it generated, I don't have a sense of what its missing and needed follow-on prompts to create. E.g.: I tell it to write an endpoint that allows users to upload files. A few days later, we realize we aren't MD5-hashing the files that got uploaded; there was a field in the database & resource type to store this value, but it didn't pick up on that, and I didn't prompt it to do this; so its not unreasonable. But oftentimes when I'm writing routes by hand, I'm spending so much time in that function body that follow-on requirements naturally occur to me ("Oh that's right, we talked about needing this route available to both of these two permissions, crap let me implement that"). With AI, it finishes so fast that my brain doesn't have time to remember all the requirements.

2. We've tried to mitigate this by pushing more development into the specs and requirements up-front. This is really hard to get humans to do, first of all. But more critically: None of our data supports the hypothesis that this has shortened cycle times. It mostly just trades writing typescript for reading & writing English (which few engineers I've ever worked with are actually all that good at). The engineers still end up needing long cycle times back-and-forth with the AI to get correct results, and long cycle times in review.

3. The more code you ask it to generate, the more vibeslop you get. Deeply-nested try/catch statements with multiple levels of error handling & throwing. Comments everywhere. Reimplementing the same helper functions five times. These things, we have found, raise the cost and lower the reliability & performance of future prompting, and quickly morph parts of the system into a no-man's-land (literally) where only AIs can really make any change; and every change even by the AIs get harder and harder to ship. Our reported customer issues on these parts of the app are significantly higher than others, and our ability to triage these issues is also impacted because we no longer have SMEs that can just brain-triage issues in our CS channels; everything now requires a full engineering cycle, with AI involvement, to solve.

Our engineers run the spectrum of "never wanted to touch AI, never did" to "earnestly trying to make it work". Ultimately I think the consensus position is: Its a tool that is nice to have in the toolbox, but any assertion that its going to fundamentally change the profile of work our engineers do, or even seriously impact hiring over the long-term, is outside the realm of foreseeable possibility. The models and surrounding tooling are not improving fast enough.

jasonlotitoyesterday at 5:53 PM

The title (Our new report: AI code creates 1.7x more problems) is wrong.

The article says problems, not bug. Problems seems to include formatting or naming issues. I couldn't find 1.7x bugs specifically. I only see 3 mentions of bugs, with no number attached.

andrenotgiantyesterday at 7:06 PM

... _says company selling a service to use AI to find bugs._

carrayesterday at 3:47 PM

Am I the only one thinking that 1.7x is a very weird way of saying "70% more"? It's even wrong since, like other comments point out, 1.7x MORE would in fact be 2.7 times as much. Which is not what the bug numbers say.

bilateryesterday at 6:04 PM

for now

SideburnsOfDoomyesterday at 2:05 PM

> ship faster but it produces more bugs

This is ... not actually faster.

mmastracyesterday at 2:05 PM

In the pre-AI days I worked on a system like this that was constructed by a high-profile consulting team but continuously lost data and failed to meet even the basic standards.

I think I've seen so much rush-shipped slop (before and after) that I'm really anxiously waiting for this bubble to pop.

I have yet to be convinced that AI tooling can provide more than 20% or so speedup for an expert developer working in a modern stack/language.

geldedusyesterday at 3:18 PM

Not for me.

naaskingyesterday at 3:16 PM

It's totally plausible that AI codegen produces more bugs. It still seems important to familiarize yourself with these tools now though, because that bug count is only ever going to go down. These tools are here to stay.

show 4 replies
TheAnkurTyagiyesterday at 5:02 PM

code reviews take way longer now because you gotta actually read through everything instead of trusting the dev knew what they were writing. Its like the AI is great at the happy path but completely misses edge cases or makes weird assumptions about state...

The real kicker is when someone copies AI generated code without understanding it and then 3 months later nobody can figure out why production keeps having these random issues. debugging AI slop is its own special hell

bgwalteryesterday at 2:15 PM

The report is from cortex.io, based on only 50 self-selected responses from "engineering leaders" as well as from idpcon.com, hosted by cortex.

All websites involved are vibe coded garbage that use 100% CPU in Firefox.