Are we already in the time, or close to the time, that well-trained LLMs are more efficient in finding security holes than all but the best developers out there, even for OS kernel code? Can someone educate me on this?
My theory is, that a lot of security bugs are low hanging fruit for LLMs in the sense that it is a bit tedious but not that hard pattern matching. (Let's see the free occurs in foo(), so if I trigger bar() after foo() then I have a use after free, that should be possible if I trigger an exception in baz::init().)
Efficiency in finding isn't really the metric to consider. I'm sure a good security person could look at these and find the bugs, but nobody did.
IMHO, if you were to do a manual audit of the Linux kernel, the first thing to do is exclude all the stuff you're never going to run, because why spend time on it?
These scans are looking at everything, because once you set it up, the incremental cost to look at everything is not so bad.
This is going to push lesser used stuff out of the mainline, which sucks for people who were using it, but is better for everyone else.
My experience with these tools is that they generate absolutely enormous amounts of insidiously wrong false positives, and it actually takes a decent amount of skill to work through the 99% which is garbage with any velocity.
Of course some people don't do that, and send all the reports anyway... and then scream from the hilltops about how incredible LLMs are when by sheer luck one happens to be right. Not only is that blatant p-hacking, it's incredibly antisocial.
It's disingenuous marketing speak to say LLMs are "finding" any security holes at all: they find a thousand hypotheticals of which one or two might be real. A broken clock is right twice a day.
"Even for OS kernel code" is doing a lot of work. What you really mean is "legacy C code" and yes, since about 6 months ago these systems have gotten reliable enough that they are basically superhuman at identifying buffer overflows / etc. A remarkable number of these bugs are fixed by adding a (if (length > MAX_BUFFER) {return -1;}), just the classic C footguns. Even as a huge LLM skeptic I am not too too surprised that these systems might be superhuman at finding tedious tricky stuff like this.
At the same time, a lot of these bugs were in places that people weren't looking because it's not actually important. This kernel code had already been a longstanding problem in terms of low-effort bot-driven security reports and nobody had any interest in maintaining it. So this was more LLM-assisted technical management than LLM-assisted security, it finally made a situation uncomfortable enough for the team to do something about it.
Another example: Mythos found a real bug in FreeBSD that occurs when running as an NFS with a public connection. But... who on earth is doing that? I would guess 99.9% of FreeBSD NFS installations are on home LANs. More importantly, Anthropic spent $20,000 to find this bug. Just think in terms of paying a full-time FreeBSD dev for a month and that's what they find: I'd say "ok, looks like FreeBSD has a pretty secure codebase, let's fix that stupid bug, stop wasting our money, and get you on a more exciting project."
I do think anyone who has a legacy open-source C/C++ codebase owes it to their users to run it by Claude/Codex, check your pointers and arrays, make sure everything looks ok. I just wish people were able to discuss it in proper context about other native debugging tools!
> well-trained LLMs are more efficient in finding security holes than all but the best developers out there, even for OS kernel code?
No.
Like everything else an LLM touches, it is prone to slop and hallucinations.
You still need someone who knows what they are doing to review (and preferably manually validate) the findings.
What all this recent hype carefully glosses over is the volume of false-positives. I guarantee you it is > 0 and most likely a fairly large number.
And like most things LLM, the bigger the codebase the more likely the false-positives due to self-imposed context window constraints.
Its all very well these blog posts saying "LLM found this serious bug in Firefox", well yeah but that's only because the security analyst filtered out all the junk (and knew what to ask the LLM in the prompt in the first place).
We are there. This is pretty much the reason why Mythos isn't being released publically.
"More efficient" of course has many axes (cost, energy consumption, manual labor requirement vs cost of human, time, quality, etc.). However, as a long-time reverse engineer and exploit developer who has worked in the field professionally, I would say LLMs are now useful; their utility exceeds that which was previously available. That is, LLM assisted exploit discovery and especially development is faster, more efficient, and ultimately cheaper than non-LLM assisted processes.
What commenters don't seem to understand is that especially CVE spam / bug bounty type vulnerability research has always been an exercise in sifting through useless findings and hallucinations, and LLMs, used well, are great at reducing this burden.
Previously, a lot of "baseline" / bottom tier research consisted of "run fuzzers or pentest tools against a product; if you're a bottom feeder just stuff these vulns all into the submission box, if you're more legit, tediously try to figure out which ones are reachable." LLMs with a test harness do an _amazing_ job at reducing this tedium; in the memory safety space "read across 50 files to figure out if this UAF might be reachable" or in the web space, "follow this unsanitized string variable to see if it can be accessed by the user" are tasks that LLMs with a harness are awesome. The current models are also about 50% there at "make a chain for this CVE," depending on the shape of the CVE (they usually get close given a good test harness).
It seems that the concern with the unreleased models is pretty much that this has advanced once again from where it is today (where you need smart prompting and a good harness) to the LLM giving you exploit chains in exchange for "giv 0day pl0x," and based on my experience, while this has got an element of puffery and classic capitalist goofiness to it ("the model is SO DANGEROUS only our RICHEST CUSTOMERS can have it!"), I believe this is just a small incremental step and entirely believable.
To summarize: "more efficient than all but the best" comes with too many qualifiers, but "are LLMs meaningfully useful in exercising vulnerabilities in OS kernel code," or "is it possible to accelerate vulnerability research and development with LLMs" - 100% absolutely.
And you don't have to believe one random professional (me); this opinion is fairly widespread across the community:
https://sockpuppet.org/blog/2026/03/30/vulnerability-researc...
https://lwn.net/Articles/1065620/
etc.
In terms of quantity, definitely yes (a single person managing a swarm of Opusi can already find much more real bugs than a security researcher, hence the rise in reports).
In terms of quality ("are there bugs that professional humans can't see at any budget but LLMs can?") - it's not very clear, because Opus is still worse than a human specialist, but Mythos might be comparable. We'll just have to wait and see what results Project Glasswing gets.
Either way, cybersecurity is going to get real weird real soon, because even slightly-dumb models can have a large effect if they are cheap and fast enough.
EDIT: Mozilla thinks "no" to the second question, by the way: "Encouragingly, we also haven’t seen any bugs that couldn’t have been found by an elite human researcher.", when talking about the 271 vulnerabilities recently found by Mythos. https://blog.mozilla.org/en/firefox/ai-security-zero-day-vul...