> Supports tool calling in OpenAI-style format
So Harmony? Or something older? Since Z.ai also claim the thinking mode does tool calling and reasoning interwoven, would make sense it was straight up OpenAI's Harmony.
> in theory, I could get a "relatively" cheap Mac Studio and run this locally
In practice, it'll be incredible slow and you'll quickly regret spending that much money on it instead of just using paid APIs until proper hardware gets cheaper / models get smaller.
No, it's not Harmony; Z.ai has their own format, which they modified slightly for this release (by removing the required newlines from their previous format). You can see their tool call parsing code here: https://github.com/sgl-project/sglang/blob/34013d9d5a591e3c0...
In practice the 4bit MLX version runs at 20t/s for general chat. Do you consider that too slow for practical use?
What example tasks would you try?
> In practice, it'll be incredible slow and you'll quickly regret spending that much money on it instead of just using paid APIs until proper hardware gets cheaper / models get smaller.
Yes, as someone who spent several thousand $ on a multi-GPU setup, the only reason to run local codegen inference right now is privacy or deep integration with the model itself.
It’s decidedly more cost efficient to use frontier model APIs. Frontier models trained to work with their tightly-coupled harnesses are worlds ahead of quantized models with generic harnesses.