It doesn't matter if the answer is wrong. You run the test program and then replace the code by the answer. This basically weeds out the UB.
But since it is a UB, there's no guarantee that your test program produces the same result as the same code running on production, even if you have the same compiler.
But since it is a UB, there's no guarantee that your test program produces the same result as the same code running on production, even if you have the same compiler.