My quickie: MoE model heavily optimized for coding agents, complex reasoning, and tool use. 358B...

jtrn • yesterday at 8:44 PM • 7 replies • view on HN

My quickie: MoE model heavily optimized for coding agents, complex reasoning, and tool use. 358B/32B active. vLLM/SGLang only supported on the main branch of these engines, not the stable releases. Supports tool calling in OpenAI-style format. Multilingual English/Chinese primary. Context window: 200k. Claims Claude 3.5 Sonnet/GPT-5 level performance. 716GB in FP16, probably ca 220GB for Q4_K_M.

My most important takeaway is that, in theory, I could get a "relatively" cheap Mac Studio and run this locally, and get usable coding assistance without being dependent on any of the large LLM providers. Maybe utilizing Kimik2 in addition. I like that open-weight models are nipping at the feet of the proprietary models.

Replies

hasperdi • yesterday at 11:07 PM

I bought a second‑hand Mac Studio Ultra M1 with 128 GB of RAM, intending to run an LLM locally for coding. Unfortunately, it's just way too slow.

For instance, an 4‑bit quantized model of GLM 4.6 runs very slowly on my Mac. It's not only about tokens per second speed but also input processing, tokenization, and prompt loading; it takes so much time that it's testing my patience. People often mention about the TPS numbers, but they neglect to mention the input loading times.

➕ show 3 replies

embedding-shape • yesterday at 8:47 PM

> Supports tool calling in OpenAI-style format

So Harmony? Or something older? Since Z.ai also claim the thinking mode does tool calling and reasoning interwoven, would make sense it was straight up OpenAI's Harmony.

> in theory, I could get a "relatively" cheap Mac Studio and run this locally

In practice, it'll be incredible slow and you'll quickly regret spending that much money on it instead of just using paid APIs until proper hardware gets cheaper / models get smaller.

➕ show 3 replies

__natty__ • yesterday at 9:08 PM

I can imagine someone from the past reading this comment and having a moment of doubt

➕ show 1 reply

mft_ • yesterday at 10:37 PM

I’m never clear, for these models with only a proportion active (32B here) to what extentt this reduces the RAM a system needs, if at all?

➕ show 4 replies

reissbaker • yesterday at 8:51 PM

s/Sonnet 3.5/Sonnet 4.5

The model output also IMO look significantly more beautiful than GLM-4.6; no doubt in part helped by ample distillation data from the closed-source models. Still, not complaining, I'd much prefer a cheap and open-source model vs. a more-expensive closed-source one.

Tepix • today at 12:13 AM

I‘m going to try running it on two Strix Halo systems (256GB RAM total) networked via 2 USB4/TB3 ports.

whimsicalism • yesterday at 11:20 PM

commentators here are oddly obsessed with local serving imo, it's essentially never practical. it is okay to have to rent a GPU, but open weights are definitely good and important.

➕ show 2 replies

alt Hacker News

Replies