> Supports tool calling in OpenAI-style format So Harmony? Or something older? Since Z.ai also ...

embedding-shape • yesterday at 8:47 PM • 3 replies • view on HN

> Supports tool calling in OpenAI-style format

So Harmony? Or something older? Since Z.ai also claim the thinking mode does tool calling and reasoning interwoven, would make sense it was straight up OpenAI's Harmony.

> in theory, I could get a "relatively" cheap Mac Studio and run this locally

In practice, it'll be incredible slow and you'll quickly regret spending that much money on it instead of just using paid APIs until proper hardware gets cheaper / models get smaller.

Replies

biddit • yesterday at 8:58 PM

> In practice, it'll be incredible slow and you'll quickly regret spending that much money on it instead of just using paid APIs until proper hardware gets cheaper / models get smaller.

Yes, as someone who spent several thousand $ on a multi-GPU setup, the only reason to run local codegen inference right now is privacy or deep integration with the model itself.

It’s decidedly more cost efficient to use frontier model APIs. Frontier models trained to work with their tightly-coupled harnesses are worlds ahead of quantized models with generic harnesses.

➕ show 1 reply

reissbaker • yesterday at 8:53 PM

No, it's not Harmony; Z.ai has their own format, which they modified slightly for this release (by removing the required newlines from their previous format). You can see their tool call parsing code here: https://github.com/sgl-project/sglang/blob/34013d9d5a591e3c0...

rz2k • yesterday at 9:33 PM

In practice the 4bit MLX version runs at 20t/s for general chat. Do you consider that too slow for practical use?

What example tasks would you try?

alt Hacker News

Replies