logoalt Hacker News

Mamba-3

231 pointsby matt_dlast Tuesday at 10:45 PM44 commentsview on HN

Comments

nltoday at 6:22 AM

I'm looking forward to comparing this to Inception 2 (the text diffusion model) which in my experience is very fast and reasonably high quality.

show 3 replies
Havoctoday at 1:27 PM

Is there a reason we don’t switch halfway through? ie start with a classic LLM and switch to something linear like mamba as context grows

show 5 replies
jeffhwangtoday at 3:56 PM

I'm glad I clicked through bc I thought the article was about Mamba, the package manager I associate with Python (similar to conda).

https://github.com/mamba-org/mamba

jychangtoday at 9:44 AM

I'm not sure that I buy their conclusion that more compute during inference is good.

Yes, batch=1 inference is mostly memory bandwidth bound, not GPU compute bound. But no provider does batch=1 inference. Everyone groups all the requests into a batch, and the GPU computes them together.

With a fused kernel, that means the GPU streams the tensors from VRAM, and does a bunch of compute on different conversations in the batch, at the same time.

If they increase the amount of compute required per token, that just reduces the maximum batch size a GPU can handle. In practice, yes this does mean each GPU can serve less users. Providers aren't leaving GPU cores idle normally during inference.

show 3 replies
diablevvtoday at 2:21 PM

[dead]

daliliutoday at 1:02 PM

[dead]

robofanatictoday at 6:09 AM

> Mamba-3 is a new state space model (SSM) designed with inference efficiency as the primary goal — a departure from Mamba-2, which optimized for training speed. The key upgrades are a more expressive recurrence formula, complex-valued state tracking, and a MIMO (multi-input, multi-output) variant that boosts accuracy without slowing down decoding.

Why can’t they simply say -

Mamba-3 focuses on being faster and more efficient when making predictions, rather than just being fast to train like Mamba-2.

show 6 replies