Nothing points out that the benchmark is invalid like a zero false positive rate. Seemingly it is pre-2020 text vs a few models rework of texts. I can see this model fall apart in many real world scenarios. Yes, LLMs use strange language if left to their own devices and this can surely be detected. 0% false positive rate under all circumstances? Implausible.
> Nothing points out that the benchmark is invalid like a zero false positive rate
You’re punishing them for claiming to do a good job. If they truly are doing a bad job, surely there is a better criticism you could provide.
Our benchmarks of public datasets put our FPR roughly around 1 in 10,000. https://www.pangram.com/blog/all-about-false-positives-in-ai...
Find me a clean public dataset with no AI involvement and I will be happy to report Pangram's false positive rate on it.