How do you determine if something is publish worthy? If someone puts in a lot of effort experimenting with something that fails, it can still seem publish worthy so others can learn what works and what doesn't. It should be more about level of effort I think. Otherwise the incentives become all wrong too.
In this case the algorithm can determine broad classes like "rural" or "city", and aside from those classes the generated images have little connection with the audio. I think most DL researchers would agree that this is low-effort stuff, and therefore not publish-worthy. In addition to this the word "accurate" in the title is misleading.