No, what we were seeing with curl was script kiddies. It wasn't about the quality of the models at all. They were not filtering their results for validity.
It was definitely partially about model quality. The frontier models are capable of producing valid findings with (reasonably) complex exploit chains on the first pass (or with limited nudging) and are much less prone to making up the kinds of nonsensical reports that were submitted to curl. Compared to now, the old models essentially didn't work for security.
If those script kiddies had been using today's models instead and _still_ didn't do any filtering, a lot more of those bugs would have been true positives.
It was definitely partially about model quality. The frontier models are capable of producing valid findings with (reasonably) complex exploit chains on the first pass (or with limited nudging) and are much less prone to making up the kinds of nonsensical reports that were submitted to curl. Compared to now, the old models essentially didn't work for security.
If those script kiddies had been using today's models instead and _still_ didn't do any filtering, a lot more of those bugs would have been true positives.