I'll share the first-hand account I recently got from someone else.
> We've used it at work
> it is... not as hype as everyone is concerned about
> I'd argue the framework around it for security scanning is the arguably more useful side of the tool, definitely doesnt take a huge model to get all the issues it flagged on our systems
> For us, it absolutely flooded us with noise
> I mean hundreds if not thousands of false positives or minor issues or not applicable
> For every one reasonable issue
> The biggest issue it created was the execs treated every issue it produced like it was a drop everything and fix the issue type deal
> I'm talking company wide drop all things "we need to patch nginx because this module that no one uses and is disabled by default has this RCE vulnerability™
> Or "all ec2 AMIs need to be upgraded because it flagged a a version specific docker vulnerability", it flagged every single machine with docker regardless of if the actual vulnerability was relevant
> Vulnerability was with a very specific Auth plugin configuration you could enable with docker and specifically the Mosley docker compatible tool, but it is clear it only knew there was a vulnerability in docker, not if it was applicable or not
> Meanwhile dirtyfrag and friends not a single peep from btw despite it allowing for container escape
> Idk, I was underwhelmed with the quality of the reporting it gave really. If the company allowed me to get information about all the infrastructure in our entire organisation to run Claude over it repeatedly looking for recent CVEs I'm sure I could produce the same results...
In other words it is equivalent to spending a million dollars on an audit by a software security consulting company
I think Opus 4.6 and Mythos overall/marketing wise are key points because it told the world that LLMs are now a critical / usefull tool for security audits.
Its aligns with the significant jump in helpfulness in CTF.
But i think its good to hear that its not that crazy good. Everything slowing it down is good.
I'm pretty impressed with regular Claude Code with Opus 4.7/4.8 in finding vulnerabilities in our code. Maybe 70% are false positives though. It's a lot of work to manually push back on the findings. Still worth it.
I thought one of the advantages of Glasswing was that it could produce a PoC for you. Was it producing working PoC's?
why are folks looking at the output of the first pass?
my understanding, and experience, is that you 1. run a bunch of sessions with small permutations to create variety, 2. run more sessions dedupe reports into a smaller collections of potential vulns, 3. run a handful of agents at max effort to write PoCs + write-ups, 4. rank findings, 5. finally look at what, if anything that, was found. maybe ask questions, try and understand if the PoC is running against a realistic setup.
until you can confirm a vuln report is valid, you must assume it is invalid.
Seems like there might be a market for a product that just prefixes "The AI Said" on emails to executives about security vulns.
This reminds me of when I added Snyk to our CI/CD and brought development to a standstill
This is the same gripe I have over any LLM vulnerability tooling. 95% of what gets flagged is something that if taken by itself could be a vulnerability. However, the path to execute that specific vuln, in that specific function, is impossible in that particular code base and it just makes noise.
In other words it creates work. In other words Jevons paradox.
I can’t wait for the first court case where an LLM surfaces a vuln, lazy devs ignore it, and someone later sues the company into oblivion for liability.
> The biggest issue it created was the execs treated every issue it produced like it was a drop everything and fix the issue type deal
While this is definitely not the ideal end of the spectrum either, execs treating security issues as something serious instead of annoyances that should only be addressed if revenue can be tied to doing so is a welcome improvement.
[dead]
It seems like there is a genuine communication breakdown between management and engineering. Engineers know that there are vulnerabilities all over the place and that there have been for ages and that where the rubber hits the road every vulnerability does not represent a successful exploit by some nefarious actor.
Management can often treat cybersecurity like a black box that represents millions upon millions in liability. If Mythos represents an opportunity to bring management's understanding of the amount of "security vulnerability debt" everyone carries into the real world, it might be a good thing