How do you know? I can believe that they didn't show memory errors in a quick test run on a common architecture with a common compiler, much like most human-written code in the training corpus.
It wasn't code worth formally verifying, but even your description beats almost any programmer's first pass. With how good it is at finding bugs if you ask it, I have little reason to doubt its output.
It wasn't code worth formally verifying, but even your description beats almost any programmer's first pass. With how good it is at finding bugs if you ask it, I have little reason to doubt its output.