Is there a reason we don’t switch halfway through? ie start with a classic LLM and switch to somethi...

Havoc • today at 1:27 PM • 5 replies • view on HN

Is there a reason we don’t switch halfway through? ie start with a classic LLM and switch to something linear like mamba as context grows

Replies

lambda • today at 2:01 PM

Because something linear like Mamba doesn't perform as well; so you'd have a performance cliff, where suddenly the model would get more dumb and forget a lot of what was going on.

Instead, you can get benefits from both by doing both in parallel. This can let you reduce the size of the O(n^2) attention mechanism, so while it's still quadratic, it reduces the constant quite a bit while still retaining a lot of performance, as the linear context mechanism can work for the tasks its well suited for while allowing attention to play to its strengths.

The recent Nemotron 3 Nano and Super models from NVIDIA are hybrid architectures this way, with most of their context layers as Mamba while retaining enough attention to continue to be competitive on the more complex tasks that require the quadratic attention.

See https://magazine.sebastianraschka.com/i/168650848/18-nemotro... for some discussion on this architecture

0xbadcafebee • today at 5:03 PM

They did do that, 2 years ago. The problems are that 1) mamba makes accuracy worse as context size grows, 2) Nvidia GPUs are designed for transformers, and 3) all the software out there is also designed for transformers. It's still useful in some applications but it doesn't beat regular transformers if you have the gear

energy123 • today at 1:47 PM

Probably best achieved by model routing, either an indirection behind the chat UI or an API user does it themselves by calling a different API for long context queries.

mountainriver • today at 2:01 PM

We kinda do do this with hybrid mamba transformers

cubefox • today at 1:52 PM

Linear time complexity models are bad at in-context retrieval, which limits their performance on various tasks, so a pure linear model isn't currently feasible anyway, at least for language models. Instead they recommend mixing linear and attention layers. Presumably this mostly solves the performance problem (at least n benchmarks), but it also means the mixed architecture is no longer linear. It will still be faster and less RAM hungry in long context than a pure transformer though.

alt Hacker News

Replies