logoalt Hacker News

credit_guyyesterday at 9:54 PM4 repliesview on HN

Like this?

https://huggingface.co/amd/Zebra-Llama-8B-8MLA-24Mamba-SFT


Replies

jychangyesterday at 11:12 PM

Or like this: https://api-docs.deepseek.com/news/news251201

I don't know what's so special about this paper.

- They claim to use MLA to reduce KV cache by 90%. Yeah, Deepseek invented that for Deepseek V2 (and also V3 and Deepseek R1 etc)

- They claim to use a hybrid linear attention architecture. So does Deepseek V3.2 and that was weeks ago. Or Granite 4, if you want to go even further back. Or Kimi Linear. Or Qwen3-Next.

- They claimed to save a lot of money not doing a full pre-train run for millions of dollars. Well, so did Deepseek V3.2... Deepseek hasn't done a full $5.6mil full pretraining run since Deepseek V3 in 2024. Deepseek R1 is just a $294k post train on top of the expensive V3 pretrain run. Deepseek V3.2 is just a hybrid linear attention post-train run - i don't know the exact price, but it's probably just a few hundred thousand dollars as well.

Hell, GPT-5, o3, o4-mini, and gpt-4o are all post-trains on top of the same expensive pre-train run for gpt-4o in 2024. That's why they all have the same information cutoff date.

I don't really see anything new or interesting in this paper that isn't already something Deepseek V3.2 has already sort of done (just on a bigger scale). Not exactly the same, but is there anything amazingly new that's not in Deepseek V3.2?

show 3 replies
deepdarkforestyesterday at 11:12 PM

> which does pose interesting questions over nvidia's throne...

> Zebra-Llama is a family of hybrid large language models (LLMs) proposed by AMD that...

Hmmm

adityashankaryesterday at 10:07 PM

yes!, thanks for the link!

moffkalastyesterday at 11:09 PM

GGUF when? /s