Have you read firefoxes findings? They found it to be qualitatively improved over Opus, and have published several of the resulting CVEs as well as more detailed numbers.
They also seem to point to it being more the harness than the model itself.
They also seem to point to it being more the harness than the model itself.