That would be explained by data leakage too, e.g. sampling frames in the train and test data from the same video sequences. There's nothing in the writeup that says suggests the model is explicitly matching audio to ground truth images.
The researchers' suggestion that certain architectural features might have been encoded in the sound [which is at least superficially plausible] is rather undermined the data leakage in the model also leading to it generate the right colour signage in the right part of multiple images. The fidelity of the sound clearly isn't enough for the model to register key aspects of the sign's geometry like it only being a few feet from the observer, but it has somehow managed to pick up that it's green and x pixels from the left of the image...
The researchers' suggestion that certain architectural features might have been encoded in the sound [which is at least superficially plausible] is rather undermined the data leakage in the model also leading to it generate the right colour signage in the right part of multiple images. The fidelity of the sound clearly isn't enough for the model to register key aspects of the sign's geometry like it only being a few feet from the observer, but it has somehow managed to pick up that it's green and x pixels from the left of the image...