They do use diffusion models, but I don't think they would make a detour via images. They can just generate audio directly with audio diffusion rather than image diffusion.
There technically was one experiment early on to trick Stable Diffusion into generating spectrograms that could be converted into audio. And, it worked surprisingly well.
There technically was one experiment early on to trick Stable Diffusion into generating spectrograms that could be converted into audio. And, it worked surprisingly well.
https://web.archive.org/web/20230314190913/https://www.riffu...
https://huggingface.co/riffusion/riffusion-model-v1
But, I'd expect everything in the past 3 years to diffuse the audio waveform directly.