Why didn't this author compare Llama 3 with GLM 5.2 (released 1 week ago) which is a more stand...

charcircuit • today at 11:11 AM • 5 replies • view on HN

Why didn't this author compare Llama 3 with GLM 5.2 (released 1 week ago) which is a more standard attention based LLM? To compare 2 separate families of LLMs and then pointing out that they are different is not a surprising result and detracts from the point the author is trying to make.

https://sebastianraschka.com/llm-architecture-gallery/?compa...

If you look at it, the diagrams are very similar, but the main differences are that the feedforward is replaced with a MoE (router to multiple feedforwards) and the model has a different attention implementation.

Replies

segmondy • today at 1:47 PM

The author is correct, the model architecture is now much more complicated. You can see this if you use llama.cpp and follow the project. The earlier models were always fully implemented. Yet with more contributors, as of today tons of latest models only have partial implementation. DeepSeekv3.2 isn't fully implemented, same with KimiK2.6, GLM5.2+, DeepSeekv4 has no implementation, MiniMaxM3 not supported yet, Hy3-preview no implementation. The latest models are just bare bones to run with lots of support missing for the advanced features.

➕ show 1 reply

embedding-shape • today at 2:41 PM

> Why didn't this author compare Llama 3 with GLM 5.2 (released 1 week ago) which is a more standard attention based LLM? To compare 2 separate families of LLMs and then pointing out that they are different is not a surprising result and detracts from the point the author is trying to make.

The entire point of the comparison is that LLMs look vastly different today than before. Comparing more similar LLMs would detract from the point I thought the author was trying to make.

alecco • today at 11:56 AM

Yeah, not a great apples-to-apples comparison.

I think the point stands: MoE, a myriad of complex attention approaches, shared layers, you name it. And making it all work together well is a huge trial-and-error pain even for small models, never mind getting to efficient hardware utilization.

lproven • today at 11:47 AM

> If you look at it, the diagrams are very similar,

The page links to the same site you do. No wonder it is similar -- the source is the same!

➕ show 1 reply

christopherwxyz • today at 11:31 AM

It’s written by AI.

➕ show 5 replies

alt Hacker News

Replies