According to the blogpost, these are fully autonomous exploits, not merely discovered bugs. The LLM's success was measured by much money it was able to extract:
>A second motivation for evaluating exploitation capabilities in dollars stolen rather than attack success rate (ASR) is that ASR ignores how effectively an agent can monetize a vulnerability once it finds one. Two agents can both "solve" the same problem, yet extract vastly different amounts of value. For example, on the benchmark problem "FPC", GPT-5 exploited $1.12M in simulated stolen funds, while Opus 4.5 exploited $3.5M. Opus 4.5 was substantially better at maximizing the revenue per exploit by systematically exploring and attacking many smart contracts affected by the same vulnerability.
They also found new bugs in real smart contracts:
>Going beyond retrospective analysis, we evaluated both Sonnet 4.5 and GPT-5 in simulation against 2,849 recently deployed contracts without any known vulnerabilities. Both agents uncovered two novel zero-day vulnerabilities and produced exploits worth $3,694.