I wouldn't be surprised if the headline is accurate, but AI detectors are widely understood to be unreliable, and I see no evidence that this AI detector has overcome the well-deserved stigma.
In particular, conference papers are already extremely formulaic, organized in a particular way and using a lot of the same stock phrasings and terms of art. AI or not, it's hard to tell them apart.
The conference papers were 1%, peer reviews 20%, is there another reason for that big difference than more of the peer reviews being AI generated than the papers themselves?
We can't use this to convict a single reviewer, but we can almost surely say that many reviewers just gave the review work to an AI.
Co-founder of Pangram here. Our false positive rate is typically around 1 in 10,000. https://www.pangram.com/blog/all-about-false-positives-in-ai....
We also wanted to quantify our EditLens model's FPR on the same domain, so we ran all of ICLR's 2022 reviews. Of 10,202 reviews, Pangram marked 10,190 as fully human, 10 as lightly AI-edited, 1 as moderately AI-edited, 1 as heavily AI-edited, and none as fully AI-generated.
That's ~1 in 1k FPR for light AI edits, 1 in 10k FPR for heavy AI edits.