Some of these are not very accurrate. That "country side" image has the entirely wrong foliage color (fall colors vs. spring colors). It also appears to place buildings when the "ground truth" image is by a small stream.
I would not rely on this tool for any meaningful data collection.
You really cannot expect audio processing to yield color information.
Beyond that, you are correct that the 3D shapes themselves cannot be derived perfectly accurately (see my other post)