logoalt Hacker News

jcalvinowensyesterday at 2:08 PM4 repliesview on HN

My experience with these tools is that they generate absolutely enormous amounts of insidiously wrong false positives, and it actually takes a decent amount of skill to work through the 99% which is garbage with any velocity.

Of course some people don't do that, and send all the reports anyway... and then scream from the hilltops about how incredible LLMs are when by sheer luck one happens to be right. Not only is that blatant p-hacking, it's incredibly antisocial.

It's disingenuous marketing speak to say LLMs are "finding" any security holes at all: they find a thousand hypotheticals of which one or two might be real. A broken clock is right twice a day.


Replies

binaryturtleyesterday at 2:39 PM

I used GitHub's Copilot once and let it check one of my repositories for security issues. It found countless (like 30 or 40 or so for a single PHP file of some ~400 lines). Some even sounded reasonable enough, so I had a closer look, just to make sure. In the end none of it was an issue at all. In some cases it invented problems which would have forced to add wild workaround code around simple calls into the PHP standard library. And that was the only time I wasted my time with that. :D

show 1 reply
NitpickLawyeryesterday at 2:32 PM

Your experience seems to be at least 3-6 months old. Long time kernel maintainers have recently written on this subject. They say that ~3 months ago the quality and accuracy of the reports crossed a threshold and are now legitimately useful.

show 1 reply
Legend2440yesterday at 2:35 PM

This is incorrect. Here's the curl maintainer talking about dozens of bugs found using LLMs: https://daniel.haxx.se/blog/2025/10/10/a-new-breed-of-analyz...

show 1 reply
bri3dyesterday at 2:52 PM

I strongly disagree with this take, and frankly, this reads like the state of "research" pre-LLMs where people would run fuzzers and scripted analysis tools (which by their nature DO generate enormous amounts of insidiously wrong false positives) and stuff them into bug bounty boxes, then collect a paycheck when one was correct by luck.

Modern LLMs with a reasonable prompt and some form of test harness are, in my experience, excellent at taking a big list of potential vulnerabilities and figuring out which ones might be real. They're also pretty good, depending on the class of vuln and the guardrails in the model, at developing a known-reachable vulnerability into real exploit tooling, which is also a big win. This does require the _slightest_ bit of work (ie - don't prompt the LLM with "find possible use after free issues in this code," or it will give you a lot of slop; prompt the LLM with "determine whether the memory safety issues in this file could present a security risk" and you get somewhere), but not some kind of elaborate setup or prompt hacking, just a little common sense.