Due to perverse incentives and the historical nature of models over-claiming accuracy, it's very hard to believe anything until it is open source and can be tested out
that being said, I do very much believe that computational efficiency of models is going to go up [correction] drastically over the coming months, which does pose interesting questions over nvidia's throne
*previously miswrote and said computational efficiency will go down
If the claims in the abstract are true, then this is legitimately revolutionary. I don’t believe it. There are probably some major constraints/caveats that keep these results from generalizing. I’ll read through the paper carefully this time instead of a skim and come back with thoughts after I’ve digested it.
This is great! But what if the US invests 1% of GDP in GPU datacenters and then those are not needed becaues someone created a much more efficient architecture?
It would be REALLY cool to see this same technique applied to a much more recent OSS model distillation. For example, Mistral 3 14B would be a great target. How efficient can we get inference there?
> Zebra-Llama achieves Transformer-level accuracy with near-SSM efficiency using only 7–11B training tokens (compared to trillions of tokens required for pre-training) and an 8B teacher. Moreover, Zebra-Llama dramatically reduces KV cache size—down to 3.9%, 2%, and 2.73% of the original for the 1B, 3B, and 8B variants, respectively—while preserving 100%, 100%, and 97% of average zero-shot performance on LM Harness tasks.
This is an extraordinary claim, is there a catch I’m missing? Am I misreading?
This is from May 2025, according to the arxiv watermark. Maybe that should be mentioned in the title.