Zebra-Llama: Towards Efficient Hybrid Models

59 points • by mirrir • today at 8:15 PM • 29 comments • view on HN

Comments

This is from May 2025, according to the arxiv watermark. Maybe that should be mentioned in the title.

Due to perverse incentives and the historical nature of models over-claiming accuracy, it's very hard to believe anything until it is open source and can be tested out

that being said, I do very much believe that computational efficiency of models is going to go up [correction] drastically over the coming months, which does pose interesting questions over nvidia's throne

*previously miswrote and said computational efficiency will go down

➕ show 3 replies

a_wild_dandan • today at 10:45 PM

If the claims in the abstract are true, then this is legitimately revolutionary. I don’t believe it. There are probably some major constraints/caveats that keep these results from generalizing. I’ll read through the paper carefully this time instead of a skim and come back with thoughts after I’ve digested it.

➕ show 1 reply

xer • today at 10:47 PM

This is great! But what if the US invests 1% of GDP in GPU datacenters and then those are not needed becaues someone created a much more efficient architecture?

➕ show 4 replies

Reubend • today at 10:35 PM

It would be REALLY cool to see this same technique applied to a much more recent OSS model distillation. For example, Mistral 3 14B would be a great target. How efficient can we get inference there?

mason_mpls • today at 9:44 PM

> Zebra-Llama achieves Transformer-level accuracy with near-SSM efficiency using only 7–11B training tokens (compared to trillions of tokens required for pre-training) and an 8B teacher. Moreover, Zebra-Llama dramatically reduces KV cache size—down to 3.9%, 2%, and 2.73% of the original for the 1B, 3B, and 8B variants, respectively—while preserving 100%, 100%, and 97% of average zero-shot performance on LM Harness tasks.

This is an extraordinary claim, is there a catch I’m missing? Am I misreading?

➕ show 1 reply

alt Hacker News

Zebra-Llama: Towards Efficient Hybrid Models

Comments