Not wrong, but I think it's more accurate to say: Mamba is an architecture for the middle lay...

gyrovagueGeist • today at 9:50 AM • 2 replies • view on HN

Not wrong, but I think it's more accurate to say:

Mamba is an architecture for the middle layers of the network (the trunk) which assumes decoding takes place through an autoregressive sequence (popping out tokens in order). This is the SSM they talk about.

Diffusion is an alternative to the autoregressive approach where decoding takes place through iterative refinement on a batch of tokens (instead of one at a time processing and locking each one in only looking forward). This can require different architectures for the trunk, the output heads, and modifications to the objective to make the whole thing trainable. Could mamba like ideas be useful in diffusion networks...maybe but it's a different problem setup.

Replies

joefourier • today at 2:26 PM

Mamba doesn't assume auto-regressive decoding, and you can use absolutely use it for diffusion, or pretty much any other common objective. Same with a conventional transformer. For a discrete diffusion language model, the output head is essentially the same as an autoregressive one. But yes, the training/objective/inference setup is different.

cubefox • today at 1:41 PM

Linear architectures are at least heavily used in image diffusion models. More so in fact than in language models.

alt Hacker News

Replies