As mentioned, I've just finished the implementation and started playing around with it, seems to be doing similarly well inside of my own agent harness as similarly sized "traditional" LLMs. Of course, neither come close to SOTA models, but I suppose if we can figure out the scaling issues you mention, we'd get a bit closer. The performance just feels like it's too good to quickly ditch diffusion. Do you have more info what those "can't be trained beyond low/mid size" issues are in practice today?
As mentioned, I've just finished the implementation and started playing around with it, seems to be doing similarly well inside of my own agent harness as similarly sized "traditional" LLMs. Of course, neither come close to SOTA models, but I suppose if we can figure out the scaling issues you mention, we'd get a bit closer. The performance just feels like it's too good to quickly ditch diffusion. Do you have more info what those "can't be trained beyond low/mid size" issues are in practice today?