I'm not sure what people are on in the comments. It doesn't beat the other models, but it sure competes despite its size.
GLM 5.1 is an excellent model, but even at Q4 you're looking at ~400GB. Kimi K2.5 is really good too, and at Q4 quantization you're looking at almost ~600GB.
This model? You can run it at Q4 with 70GB of VRAM. This is approaching consumer level territory (you can get a Mac Studio with 128GB of RAM for ~3500 USD).
For the Claude-pilled people, I don't know if you only run Opus but when I was on the Pro plan Sonnet was already extremely capable. This beats the latest Sonnet while running locally, without anyone charging you extra for having HERMES.md in your repo, or locking you out of your account on a whim.
Mistral has never been competitive at the frontier, but maybe that is not what we need from them. Having Pareto models that get you 80% of the frontier at 20% of the cost/size sounds really good to me.
I didn't know about HERMES.md ... (??) - found information here for others who are curious https://github.com/anthropics/claude-code/issues/53262
> For the Claude-pilled people, I don't know if you only run Opus but when I was on the Pro plan Sonnet was already extremely capable.
Before February I was able to use Opus on High exclusively on my Max plan no problem. Now I've shifted to just using Sonnet on high and yeah, its pretty capable. I love that, Claude Pilled. ;)
“This beats the latest Sonnet while running locally”
Not really.
- The benchmarks are based on F8_E4M3 and you’re not running that on any Mac.
- Sonnet has a 1M token context window. This is 256k but again you’re probably not even getting that locally.
- Sonnet is fast over the wire. This is going to be much slower.
Yeah, you can run it locally if you have enough VRAM, but the reports trickling in are saying about 3 tok/sec. This was on a Strix Halo box which definitely has the needed VRAM, but isn't going to have as high mem bandwidth as a GPU card, it's going to be similar on a Mac - that's the dilemma... the unified memory machines have the VRAM, but the bandwidth isn't great for running dense models. This size of a dense model is only going to be runnable (usefully) by very few people who have multiple GPU cards with enough memory to add up to about 70GB.
Let's not forget Qwen 35B A3B MoE. It gets better performance than this in all the metrics for a fraction of the memory / compute footprint.
Sad to see all the non Chinese open source models being at least one generation behind.
It has similar SWE bench score to qwen 3.6 27b[1]. No one is comparing it to frontier.
[1]: There is no other common benchmark in the blog.
The point is it's open weight and is tiny compared to a lot of it's competitors. 4gpus for world class performance - sweet!
The competition is on DeepSeek v4 Flash for similar size / deployment target.
>This model? You can run it at Q4 with 70GB of VRAM. >This beats the latest Sonnet while running locally
Not sure it will beat Sonet at Q4.
>This is approaching consumer level territory (you can get a Mac Studio with 128GB of RAM for ~3500 USD).
For $3500 I can get 7-8 years of GLM using coding plans, have a faster model and much better code quality.
It’s 128b dense model. Good luck getting more than 3t/s out of a mac. It doesn’t matter if it fits or not.
Eh. Those results would be noteworthy if it was a a MoE. A 120B dense? Firmly in meh territory.
I would love to be able to run frontier locally, but I think the larger importance of open weight models is price accountability.
In the US with our broken system of capitalism, it’s the only way we can tether these companies to reality. Left to their own devices, I’m not convinced they would actually compete with each other on price.
Buy nobody like to talk about how “moat” building is fundamentally anti-competitive, even in name.
Funny that self proclaimed capitalists hate the system in practice. Commodity pricing is what truly terrifies them.
[dead]
I was hoping a lot from it... but this one, is not up to that mark. For example, here is it's comparion with 4.7x smaller model, qwen3.7-27b.
https://chatgpt.com/share/69f239e8-7414-83a8-8fdd-6308906e5f...
Tldr: qwen3.6-27b, a 4.7x smaller model, have similar performance.
> This model? You can run it at Q4 with 70GB of VRAM. This is approaching consumer level territory (you can get a Mac Studio with 128GB of RAM for ~3500 USD).
The one thing I would want everyone curious about local LLMs to know is that being able to run a model and being able to run a model fast are two very different thresholds. You can get these models to run on a 128GB Mac, but we need to first tell if Q4 retains enough quality (models have different sensitivities to quantization) and how fast it runs.
For running async work and background tasks the prompt processing and token generation speeds matter less, but a lot of Mac Studio buyers have discovered the hard way that it's not going to be as responsive as working with a model hosted in the cloud on proper hardware.
For most people without hard requirements for on-site processing, the best use case for this model would be going through one of the OpenRouter hosted providers for it and paying by token.
> This beats the latest Sonnet while running locally
Almost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead.