The concept you need here is "Statistical Power".
The ELI5 version is that there are two mistakes you can make when looking at a P value:
Type I error, where your P value is falsely low. In the experiment being discussed here, it would lead one to conclude that AI code is worse. Otherwise known as a false positive.
Type II error, where your P value is falsely high, leading you to conclude that AI code is no different. Otherwise known as a false negative.
https://en.wikipedia.org/wiki/Power_(statistics)
One can calculate statistical power for a given experimental protocol.
My hunch is that if you did this, you would find this experiment is grossly under-powered.
This means you can't make the "absence of evidence" claim.
He can't make the evidence of absence claim, but he can absolutely make the absence of evidence claim.