logoalt Hacker News

Havoctoday at 1:27 PM5 repliesview on HN

Is there a reason we don’t switch halfway through? ie start with a classic LLM and switch to something linear like mamba as context grows


Replies

lambdatoday at 2:01 PM

Because something linear like Mamba doesn't perform as well; so you'd have a performance cliff, where suddenly the model would get more dumb and forget a lot of what was going on.

Instead, you can get benefits from both by doing both in parallel. This can let you reduce the size of the O(n^2) attention mechanism, so while it's still quadratic, it reduces the constant quite a bit while still retaining a lot of performance, as the linear context mechanism can work for the tasks its well suited for while allowing attention to play to its strengths.

The recent Nemotron 3 Nano and Super models from NVIDIA are hybrid architectures this way, with most of their context layers as Mamba while retaining enough attention to continue to be competitive on the more complex tasks that require the quadratic attention.

See https://magazine.sebastianraschka.com/i/168650848/18-nemotro... for some discussion on this architecture

0xbadcafebeetoday at 5:03 PM

They did do that, 2 years ago. The problems are that 1) mamba makes accuracy worse as context size grows, 2) Nvidia GPUs are designed for transformers, and 3) all the software out there is also designed for transformers. It's still useful in some applications but it doesn't beat regular transformers if you have the gear

energy123today at 1:47 PM

Probably best achieved by model routing, either an indirection behind the chat UI or an API user does it themselves by calling a different API for long context queries.

mountainrivertoday at 2:01 PM

We kinda do do this with hybrid mamba transformers

cubefoxtoday at 1:52 PM

Linear time complexity models are bad at in-context retrieval, which limits their performance on various tasks, so a pure linear model isn't currently feasible anyway, at least for language models. Instead they recommend mixing linear and attention layers. Presumably this mostly solves the performance problem (at least n benchmarks), but it also means the mixed architecture is no longer linear. It will still be faster and less RAM hungry in long context than a pure transformer though.