> Anthropic's own scaffold is described in their technical post: launch a container, prompt the model to scan files, let it hypothesize and test, use ASan as a crash oracle, rank files by attack surface, run validation. That is very close to the kind of system we and others in the field have built, and we've demonstrated it with multiple model families, achieving our best results with models that are not Anthropic's. The value lies in the targeting, the iterative deepening, the validation, the triage, the maintainer trust. The public evidence so far does not suggest that these workflows must be coupled to one specific frontier model.
The argument in the article is that the framework to run and analyze the software being tested is doing most of the work in Anthropic's experiment, and that you can get similar results from other models when used in the same way.
Maybe that's true, but they didn't actually show that that's true, since they didn't try scaffolding smaller models in a similar way at all.