Every system has problems. The better question is: what is the acceptable threshold?
For an example Medicare and Medicade had a fraud rate of 7.66%. Yes, that is a lot of billions, and there is room for improvement, but that doesn’t mean the entire system is failing: 93% of cases are being covered as intended.
The same could be said with these models. If the spoilage rate is 10%, does that mean the whole system is bad? Or is it at a tolerable threshold?
[1]: https://www.cms.gov/newsroom/fact-sheets/fiscal-year-2024-im...
I think it's worth being highly skeptical about fraud rates that are stated to two decimal places of precision. Fraud is by design hard to accurately detect. It would be more accurate to say, Medicare decides 7.66% of its cases are fraudulent according to its own policies and procedures, which are likely conservative, and cannot take into account undetected fraud. The true rate is likely higher, perhaps much higher.
There's also the problem of false negatives vs positives. If your goal is to cover 100% of true cases you can achieve that easily by just never denying a claim. That would of course yield stratospheric false positive rates (fraud). You have to understand both the FN rate (cost of missed fraud) vs the FP rate (cost of fraud fighting) and then balance them.
The same applies with using models in science to make predictions.
> The better question is: what is the acceptable threshold?
Currently we are unable to answer that question. AND THAT'S THE PROBLEMI'd be fine if we could. Well, at least far less annoyed. I'm not sure what the threshold should be, but we should always try to minimize it. At least error bounds would do a lot of good at making this happen. But right now we have no clue and that's why this is such a big question that people keep bringing up. We don't point out specific levels of error because they are small and we don't want you looking at them, rather we don't point them out because nobody has a fucking clue.
And until someone has a clue, you shouldn't trust that they error rate is low. The burden of proof is on the one making the claim of performance, not the one asking for evidence to that claim (i.e. skeptics).
Btw, I'd be careful with percentages. Especially when numbers are very high. e.g. LLMs are being trained on trillions of tokens. 10% of 1 trillion is 100 bn. The entire work of Shakespeare is 1.2M tokens... Our 10% error rate would be big enough to spoil any dataset. The bitter truth is that as the absolute number increases, the threshold for acceptable spoilage (in terms of percentage) needs to decrease.
Data leakage is an eval problem, not an accuracy problem.
That is, the problem is not that the AI is wrong X% of the time. The problem is that, in the presence of a data leak, there is no way of knowing what the value of X even is.
This problem is recursive - in the presence of a data leak, you also cannot know for sure the quantity of data that has leaked.
In the protein annotation world, which is largely driven by inferring common ancestry between a protein of unknown function and one of known function, common error thresholds range from FDR of 0.001 to 10^-6. Even a 1% error rate would be considered abysmal. This is in part because it is trivial to get 95% accuracy in prediction; the challenging problem is to get some large fraction of the non-trivial 5% correct.
"Acceptable" thresholds are problem specific. For AI to make a meaningful contribution to protein function prediction, it must do substantially better than current methods, not just better than some arbitrary threshold.