> From GLM 4.7 flash GLM 4.7 Flash is a 30b model that was far behind SOTA at launch, an...

bigyabai • yesterday at 8:20 PM • 1 reply • view on HN

> From GLM 4.7 flash

GLM 4.7 Flash is a 30b model that was far behind SOTA at launch, and I know that because I pay for z.ai inference and have run the model locally. Qwen and Deepseek V4 Flash have the same issue, and beg the question; are you really going to process a 64k agentic context at 450tok/s? That's 2+ minutes that you spend waiting for the first token to generate! Of course nobody can sell that as competitive inference, and it only gets worse with larger models. We're talking about non-interactive speeds, here.

If you're satisfied with small local models, more power to you. It puts you in the same barrel as Strix Halo enthusiasts or the guys that bought 2x3090s on Reddit. You are completely ignoring the market if you think that any of those SOCs are unprecedented or unparalleled for inference workloads, though. The free DS4 API is faster at prefill and decode, you could not give away Mac inference at zero cost and compete with what China provides for free. That's how far behind Macs are for local inference, to put things into perspective.

Replies

Danox • today at 4:05 AM

You sound like IBM in the mainframe era...

alt Hacker News

Replies