Because something linear like Mamba doesn't perform as well; so you'd have a performance c...

lambda • today at 2:01 PM • 0 replies • view on HN

Because something linear like Mamba doesn't perform as well; so you'd have a performance cliff, where suddenly the model would get more dumb and forget a lot of what was going on.

Instead, you can get benefits from both by doing both in parallel. This can let you reduce the size of the O(n^2) attention mechanism, so while it's still quadratic, it reduces the constant quite a bit while still retaining a lot of performance, as the linear context mechanism can work for the tasks its well suited for while allowing attention to play to its strengths.

The recent Nemotron 3 Nano and Super models from NVIDIA are hybrid architectures this way, with most of their context layers as Mamba while retaining enough attention to continue to be competitive on the more complex tasks that require the quadratic attention.

See https://magazine.sebastianraschka.com/i/168650848/18-nemotro... for some discussion on this architecture

alt Hacker News