"And for most deep learning papers I read, domain experts have not gone through the results with a fine-tooth comb inspecting the quality of the output. How many other seemingly-impressive papers would not stand up to scrutiny?"
Is this really not the case? I've read some of the AI papers in my field, and I know many other domain experts have as well. That said I do think that CS/software based work is generally easier to check than biology (or it may just be because I know very little bio).
Reading a paper is not the same as verifying the results is not the same as certifying their correctness. I read a lot of papers, but I typically only look at the underlying data when I intend to repurpose it for something else, and when I do, I tend to notice errors in the ground truth labels fairly quickly. Of course most models don't perform well enough for this to influence results much...
My impression with linguistics is that people do go over the papers that use these techniques carefully and come up with criticisms of them, but people don't take linguists seriously so people from other related disciplines ignore the criticisms.
Validation of biological labels easily takes years - in the OP's example it was a 'lucky' (huge!) coincidence that somebody already had spent years on one of the predicted proteins' labels. Nobody is going to stake 3-5 years of their career on validating some random model's predictions.