I'm skeptical: use two different AIs which don't share the same weaknesses + random sample of manual reviews + blacklisting users that submit adversarial inputs for X years as a deterrent.
But how do you know an input is adversarial? There are other issues: verdicts are arbitrary, the false positive rate means you'd need manual review of all the rejects (unless you wanted to reject something like 5% of genuine research), you need the appeals process to exist and you can't automate that, so bad actors can still flood your bureaucracy even if you do implement an automated review process…
But how do you know an input is adversarial? There are other issues: verdicts are arbitrary, the false positive rate means you'd need manual review of all the rejects (unless you wanted to reject something like 5% of genuine research), you need the appeals process to exist and you can't automate that, so bad actors can still flood your bureaucracy even if you do implement an automated review process…