> although later investigation suggests there may have been data leakage
I think this point is often forgotten. Everyone should assume data leakage until it is strongly evidenced otherwise. It is not on the reader/skeptic to prove that there is data leakage, it is the authors who have the burden of proof.It is easy to have data leakage on small datasets. Datasets where you can look at everything. Data leakage is really easy to introduce and you often do it unknowingly. Subtle things easily spoil data.
Now, we're talking about gigantic datasets where there's no chance anyone can manually look through it all. We know the filter methods are imperfect, so it how do we come to believe that there is no leakage? You can say you filtered it, but you cannot say there's no leakage.
Beyond that, we are constantly finding spoilage in the datasets we do have access to. So there's frequent evidence that it is happening.
So why do we continue to assume there's no spoilage? Hype? Honestly, it just sounds like a lie we tell ourselves because we want to believe. But we can't fix these problems if we lie to ourselves about them.
The supposed location of the burden of proof is really not the definitive guide to what you ought to believe that people online seem to think it is.
Every system has problems. The better question is: what is the acceptable threshold?
For an example Medicare and Medicade had a fraud rate of 7.66%. Yes, that is a lot of billions, and there is room for improvement, but that doesn’t mean the entire system is failing: 93% of cases are being covered as intended.
The same could be said with these models. If the spoilage rate is 10%, does that mean the whole system is bad? Or is it at a tolerable threshold?
[1]: https://www.cms.gov/newsroom/fact-sheets/fiscal-year-2024-im...