As someone who worked on transformer-based diffusion models before (not for language though), i can say one thing: they're hard.
Denoising diffusion models benefited a lot from the u-net, which is a pretty simple network (compared to a transformer) and very well-adapted to the denoising task. Plus diffusion on images is great to research because it's very easy to visualize, and therefore to wrap your head around
Doing diffusion on text is a great idea, but my intuition is it will prove more challenging, and probably take a while before we get something working
As someone who worked on transformer-based diffusion models before (not for language though), i can say one thing: they're hard.
Denoising diffusion models benefited a lot from the u-net, which is a pretty simple network (compared to a transformer) and very well-adapted to the denoising task. Plus diffusion on images is great to research because it's very easy to visualize, and therefore to wrap your head around
Doing diffusion on text is a great idea, but my intuition is it will prove more challenging, and probably take a while before we get something working