I wish there would be more of this research to speed things up rather than building ever larger models
Is anyone doing any form of diffusion language models that are actually practical to run today on the actual machine under my desk? There's loads of more "traditional" .gguf options (well, quants) that are practical even on shockingly weak hardware, and I've been seeing things that give me hope that diffusion is the next step forward, but so far it's all been early research prototypes.
I'd love to know what's going on with the Gemini Diffusion model - they had a preview last May and it was crazy fast but I've not heard anything since then.
Seeing half of an AR LLM's output tokens go to generating a predefined json schema bothers me so much. I would love to have an option to use diffusion for infilling.
Releasing this on the same day as Taalas's 16,000 token-per-second acceleration for the roughly comparable Llama 8B model must hurt!
I wonder how far down they can scale a diffusion LM? I've been playing with in-browser models, and the speed is painful.
A lot of this post-training recipe feels reminiscent of DINO training (teacher/student, use of stop gradients). I wonder if the more recent leJEPA SigREG regularization research might be relevant here for simpler post-training.
I do wonder why diffusion models aren't used alongside constraint decoding for programming - surely it makes better sense then using an auto-regressive model.
Google is working on a similar line of research. Wonder why they haven't rolled out a GPT40 scaled version of this yet
I think diffusion makes much more sense than auto-regressive (AR) specifically in code generation comparing to chatbot.
Is this available as open source anywhere to try?
This doesn't mention the drawback of diffusion language models, the main reason why nobody is using them: they have significantly lower performance on benchmarks than autoregressive models at similar size.
Can't wait for the day I can actually try a diffusion model on my own machine (128GB M4 Max) rather than as a hosted service. So far I haven't seen a single piece of software that supports it.
If this means there’s a 2x-7x speed up available to a scaled diffusion model like Inception Mercury, that’ll be a game changer. It feels 10x faster already…
Diffusion model papers are always interesting to read but I always feel like they need some mechanism to insert or delete tokens. In the example in the figure in this post, once it has fixed "British munchkin cats _ _ and ..." you _can't_ get to "British munchkin cats are a new and controversial breed." because there's not the right number of tokens between "cats" and "and". In a coding context, if your model samples a paren or a comma or something which is entirely plausible at that position, it can still close off an expansion which would be syntactically correct.