In this case the algorithm can determine broad classes like "rural" or "city", and aside from those classes the generated images have little connection with the audio. I think most DL researchers would agree that this is low-effort stuff, and therefore not publish-worthy. In addition to this the word "accurate" in the title is misleading.