No. To give it a fair test, we didn't tinker with model-specific context-engineering. Adding skills, examples, etc is very likely to improve performance. So is any interactive feedback.
Our example instruction is here: https://github.com/QuesmaOrg/BinaryAudit/blob/main/tasks/lig...
Why, though? That would make sense if you were just trying to do a comparative analysis of different agent's ability to use specific tools without context, but if your thesis is:
> However, [the approach of using AI agents for malware detection] is not ready for production.
Then the methodology does not support that. It's "the approach of using AI agents for malware detection with next to zero documentation or guidance is not ready for production."