I think you are misunderstanding. I don't think the network matches the audio to a ground truth image and generates an image. It just takes in audio and predicts an image. They just use the ground truth images for training the model and for evaluation purposes.
The generated images are only vaguely similar in detail to the originals, but the fact that they can estimate the macro structure from audio alone is surprising. I wonder if there's some kind of leakage between the training and test data, e.g. sampling frames from the same videos, because the idea you could get time of day right (dusk in a city) just from audio seems improbable.
EDIT: also minor correction, it's not an LLM it's a diffusion model. EDIT2: my mistake, there is an LLM too!
It certainly looks like some amount of image matching is going on. Can the model really hear the white/green sign to the left in the first example in figure 3? Can it hear the green sign to the right and red things to the left in the last example?
In response to the correction: The paper says that "we propose a Soundscape-to-Image Diffusion model, a generative Artificial Intelligence (AI) model supported by Large Language Models (LLMs)" so there's an LLM involved somewhere presumably?
I’ve heard clips of hot water being poured vs cold water, and if you heard the examples, you would probably guess right too.
Time of day seems almost easy. Are there animals noises? Those won’t sound the same all day. And traffic too. Even things like the sound of wind may generally be different in the morning vs night.
This is not to suggest the researchers are not leaking data, or that the examples were cherry picked, it seems probable they are doing one or the other. But it is to say, if you were trained on a particular intersection, and heard a sample from it, you could probably train a model to predict time of day reasonably well.