This is more like a classifier. They have a bunch of human-classified image/sound pairs, and they match unclassified sounds to the classified sounds. Then there's a Midjourney image generation step, but that's probably unnecessary.