I have worked with Google teams as well, and they taught me a fair bit about how to be rigorously skeptical. It takes domain knowledge, statistical knowledge, data, time and the computational resources to challenge them. I've done it, but it took real resources.
That said, it's a useful exercise to figure out the plan of attack. My experience is the "juice" was mainly in "easy true negative" subclasses. They weren't oversampled, but the human brain wouldn't even consider most of that data. Once you ablate those subclasses from the dataset, (which takes a lot of additional labelling effort), you can start challenging their assertions. But it's hard.
And that said I also review a number of articles in that domain, and I haven't seen a group with stronger datasets overall.