GLM-4.7-Flash

306 points • by scrlk • today at 3:12 PM • 103 comments • view on HN

Comments

Great, I've been experimenting with OpenCode and running local 30B-A3B models on llama.cpp (4 bit) on a 32 GB GPU so there's plenty of VRAM left for 128k context. So far Qwen3-coder gives the me best results. Nemotron 3 Nano is supposed to benchmark better but it doesn't really show for the kind of work I throw at it, mostly "write tests for this and that method which are not covered yet". Will give this a try once someone has quantized it in ~4 bit GGUF.

Codex is notably higher quality but also has me waiting forever. Hopefully these small models get better and better, not just at benchmarks.

➕ show 2 replies

polyrand • today at 5:51 PM

I've been using z.ai models through their coding plan (incredible price/performance ratio), and since GLM-4.7 I'm even more confident with the results it gives me. I use it both with regular claude-code and opencode (more opencode lately, since claude-code is obviously designed to work much better with Anthropic models).

Also notice that this is the "-Flash" version. They were previously at 4.5-Flash (they skipped 4.6-Flash). This is supposed to be equivalent to Haiku. Even on their coding plan docs, they mention this model is supposed to be used for `ANTHROPIC_DEFAULT_HAIKU_MODEL`.

➕ show 2 replies

vessenes • today at 3:46 PM

Looks like solid incremental improvements. The UI oneshot demos are a big improvement over 4.6. Open models continue to lag roughly a year on benchmarks; pretty exciting over the long term. As always, GLM is really big - 355B parameters with 31B active, so it’s a tough one to self-host. It’s a good candidate for a cerebras endpoint in my mind - getting sonnet 4.x (x<5) quality with ultra low latency seems appealing.

➕ show 8 replies

linolevan • today at 9:48 PM

Tried it within LMStudio on my m4 macbook pro – it feels dramatically worse than gpt-oss-20b. Of the two (code) prompts I've tried so far, it started spitting out invalid code and got stuck in a repeating loop for both. It's possible that LMStudio quantizes the model in such a manner that it explodes, but so far not a great first impression.

➕ show 1 reply

baranmelik • today at 4:59 PM

For anyone who’s already running this locally: what’s the simplest setup right now (tooling + quant format)? If you have a working command, would love to see it.

➕ show 4 replies

montroser • today at 5:18 PM

This is their blurb about the release:

    We’ve launched GLM-4.7-Flash, a lightweight and efficient model designed as the free-tier version of GLM-4.7, delivering strong performance across coding, reasoning, and generative tasks with low latency and high throughput.

    The update brings competitive coding capabilities at its scale, offering best-in-class general abilities in writing, translation, long-form content, role play, and aesthetic outputs for high-frequency and real-time use cases.

https://docs.z.ai/release-notes/new-released

➕ show 2 replies

veselin • today at 8:48 PM

What is the state of using quants? For chat models, a few errors or lost intelligence may matter a little. But what is happening to tool calling in coding agents? Does it fail catastrophically after a few steps in the agent?

I am interesting if I can run it on a 24GB RTX 4090.

Also, would vllm be a good option?

➕ show 1 reply

jcuenod • today at 7:51 PM

Comparison to GPT-OSS-20B (irrespective of how you feel that model actually performs) doesn't fill me with confidence. Given GLM 4.7 seems like it could be competitive with Sonnet 4/4.5, I would have hoped that their flash model would run circles around GPT-OSS-120B. I do wish they would provide an Aider result for comparison. Aider may be saturated among SotA models, but it's not at this size.

➕ show 3 replies

aziis98 • today at 9:31 PM

I hope we get to good A1B models as I'm currently GPU poor and can only do inference on CPU for now

➕ show 1 reply

bilsbie • today at 4:37 PM

What’s the significance of this for someone out of the loop?

➕ show 1 reply

esafak • today at 5:17 PM

When I want fast I reach for Gemini, or Cerebras: https://www.cerebras.ai/blog/glm-4-7

GLM 4.7 is good enough to be a daily driver but it does frustrate me at times with poor instruction following.

➕ show 1 reply

andhuman • today at 8:21 PM

Gave it four of my vibe questions around general knowledge and it didn’t do great. Maybe expected with a model as small as this one. Once support in llama.cpp is out I will take it for a spin.

arbuge • today at 6:48 PM

Perhaps somebody more familiar with HF can explain this to me... I'm not too sure what's going on here:

https://huggingface.co/inference/models?model=zai-org%2FGLM-...

➕ show 1 reply

infocollector • today at 5:40 PM

Maybe someone here has tackled this before. I’m trying to connect Antigravity or Cursor with GLM/Qwen coding models, but haven’t had any luck so far. I can easily run Open-WebUI + LLaMA on my 5090 Ubuntu box without issues. However, when I try to point Antigravity or Cursor to those models, they don’t seem to recognize or access them. Has anyone successfully set this up?

➕ show 1 reply

montroser • today at 5:32 PM

> SWE-bench Verified 59.2

This seems pretty darn good for a 30B model. That's significantly better than the full Qwen3-Coder 480B model at 55.4.

➕ show 1 reply

syntaxing • today at 5:56 PM

I find GLM models so good. Better than Qwen IMO. I wish they released a new GLM air so I can run on my framework desktop

dfajgljsldkjag • today at 4:20 PM

Interesting they are releasing a tiny (30B) variant, unlike the 4.5-air distill which was 106B parameters. It must be competing with gpt mini and nano models, which personally I have found to be pretty weak. But this could be perfect for local LLM use cases.

In my ime small tier models are good for simple tasks like translation and trivia answering, but are useless for anything more complex. 70B class and above is where models really start to shine.

eurekin • today at 4:31 PM

I'm trying to run it, but getting odd errors. Has anybody managed to run it locally and can share the command?

karmakaze • today at 3:40 PM

Not much info than being a 31B model. Here's info on GLM-4.7[0] in general.

I suppose Flash is merely a distillation of that. Filed under mildly interesting for now.

[0] https://z.ai/blog/glm-4.7

➕ show 2 replies

XCSme • today at 3:46 PM

Seems to be marginally better than gpt-20b, but this is 30b?

➕ show 2 replies

pixelmelt • today at 5:24 PM

I'm glad they're still releasing models dispite going public

twelvechess • today at 3:54 PM

Excited to test this out. We need a SOTA 8B model bad though!

➕ show 2 replies

epolanski • today at 3:38 PM

Any cloud vendor offering this model? I would like to try it.

➕ show 5 replies

kylehotchkiss • today at 6:58 PM

What's the minimum hardware you need to run this at a reasonable speed?

My Mac Mini probably isn't up for the task, but in the future I might be interested in a Mac Studio just to churn at long-running data enrichment types of projects

➕ show 1 reply

Haris18 • today at 6:12 PM

[dead]

wotsdat • today at 3:56 PM

[dead]

alt Hacker News

GLM-4.7-Flash

Comments