This doesn't say anything about many false positives they actually have. Yes, you can write other programs (that might even invoke another LLM!) to "check" the findings. That's a very obvious and reasonable thing to do. But all "vulnerability scanners", AI or not, must take steps to avoid FP -- that doesn't tell us how well they actually work.
The glaring omission here is a discussion of how many bugs the XBOW team had to manually review in order to make ~1k "valid" submissions. They state:
> It was a unique privilege to wake up each morning and review creative new exploits.
How much of every morning was spent reviewing exploits? And what % of them turned out to be real bugs? These are the critical questions that are (a) is unanswered by this post, and (b) determine the success of any product in this space imo.