My understanding is music generation is more like stable diffusion. It generates a waveform as an image, then turns it into an audio file.
They do use diffusion models, but I don't think they would make a detour via images. They can just generate audio directly with audio diffusion rather than image diffusion.
They do use diffusion models, but I don't think they would make a detour via images. They can just generate audio directly with audio diffusion rather than image diffusion.