logoalt Hacker News

radarsat1today at 9:52 AM0 repliesview on HN

I find it really interesting that it uses a Mamba hybrid with Transformers. Is it the only significant model right now using (at least partially) SSM layers? This must contribute to lower VRAM requirements right? Does it impact how KV caching works?