logoalt Hacker News

tarrudatoday at 1:50 PM5 repliesview on HN

Note that this is not the only way to run Qwen 3.5 397B on consumer devices, there are excellent ~2.5 BPW quants available that make it viable for 128G devices.

I've had great success (~20 t/s) running it on a M1 Ultra with room for 256k context. Here are some lm-evaluation-harness results I ran against it:

    mmlu: 87.86%

    gpqa diamond: 82.32%

    gsm8k: 86.43%

    ifeval: 75.90%
More details of my experience:

- https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discu...

- https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discu...

- https://gist.github.com/simonw/67c754bbc0bc609a6caedee16fef8...

Overall an excellent model to have for offline inference.


Replies

Aurornistoday at 2:22 PM

The method in this link is already using a 2-bit quant. They also reduced the number of experts per token from 10 to 4 which is another layer of quality degradation.

In my experience the 2-bit quants can produce output to short prompts that makes sense but they aren’t useful for doing work with longer sessions.

This project couldn’t even get useful JSON out of the model because it can’t produce the right token for quotes:

> *2-bit quantization produces \name\ instead of "name" in JSON output, making tool calling unreliable.

show 2 replies
woiletoday at 5:07 PM

Just a single m1 ultra?

show 1 reply
arjietoday at 4:05 PM

What's the tok/s you get these days? Does it actually work well when you use more of that context?

By the way, it's been a long time since I last saw your username. You're the guy who launched Neovim! Boy what a success. Definitely the Kickstarter/Bountysource I've been a tiny part of that had the best outcome. I use it every day.

show 1 reply
iwontberudetoday at 5:03 PM

Thank you, I have been using way too much credits for my personal automation.

outlogtoday at 3:45 PM

What is power usage? maybe https://www.coconut-flavour.com/coconutbattery/ can tell you estimate?

show 1 reply