For malware detection, many models are biased for or against detecting a threat (likely a thing that can be adjusted with a prompt).
I suggest tasks cannot be guessed (find, not tell). And 2d charts, both for ROC and pricing, vide https://quesma.com/benchmarks/binaryaudit/