logoalt Hacker News

famouswafflesyesterday at 7:18 PM2 repliesview on HN

Difficulty of scaling is not the only issue. Nobody is going to be particularly invested in scaling an architecture that has:

- consistently proven behind their auto-regressive counterparts in quality. Look at the dgemma benchmarks - pretty steep dropoffs and the more difficult the benchmark the worse the dropoff. That's not a good look and it's not like its some artifact of google's release. Every dllm is like this.

- And whose inference benefits are negated at scale. Transformers are still cheaper if you want to serve lots of users.

>"DiffusionGemma's speedup is designed for local and low-concurrency inference. In high-QPS cloud serving, autoregressive models can be deployed to saturate compute efficiently, so DiffusionGemma's parallel decoding offers diminishing returns and can result in higher serving costs"

Put yourself in the shoes of all the labs, even open source ones. Why would you put much effort into this ?


Replies

embedding-shapeyesterday at 7:40 PM

> - And whose inference benefits are negated at scale. Transformers are still cheaper if you want to serve lots of users.

But my entire point is about the reverse of this, the context of what I bring up is in single-user scenarios, which is where these diffusion models really make a large difference in performance.

Sure, I agree it's not a good fit for every single use case out there, everywhere. But after starting to play around with it closer myself, I think people are dismissing it a bit too quickly, at least if you're interested in running local models on your own hardware.

show 1 reply
zozbot234yesterday at 7:37 PM

Single user scenarios can also use MTP to make auto-regressive inference more compute-intensive with no loss of quality.