This is awful but hardly surprising. Someone mentioned reproducible code with the papers - but there is a high likelihood of the code being partially or fully AI generated as well. I.e. AI generated hypothesis -> AI produces code to implement and execute the hypothesis -> AI generates paper based on the hypothesis and the code.
Also: there were 15 000 submissions that were rejected at NeurIPS; it would be very interesting to see what % of those rejected were partially or fully AI generated/hallucinated. Are the ratios comperable?
Whether the code is AI generated or not is not important, what matters is that it really works.
Sharing code enables others to validate the method on a different dataset.
Even before LLMs came around there were lots of methods that looked good on paper but turned out not to work outside of accepted benchmarks