The problem with this model is that DeepSeek v4 Flash runs quite well quantized to 2 bit (see

antirez • yesterday at 8:57 PM • 1 reply • view on HN

The problem with this model is that DeepSeek v4 Flash runs quite well quantized to 2 bit (see https://github.com/antirez/llama.cpp-deepseek-v4-flash), at 30 t/s generation and 400 t/s prefill in a M3 Ultra (and not too much slower on a 128GB MacBook Pro M3 Max). It works as a good coding agent with opencode/pi, tool calling is very reliable, and so forth. All this at a speed that a 120B dense model can never achieve. So it has to compete not just with models that fit 4-bit quantized the same size, but with an 86GB GGUF file of DeepSeek v4 Flash, and it is not very easy to win in practical terms for local inference.

Note: I have more uncommitted speed improvements in my tree that I'll push soon, the current tree could be a little bit slower but not much, still super usable.

I don't understand one thing about Mistral, which I'm a fan being in Europe: they opened the open weights MoE show with Mixtral. Why are they now releasing dense models of significant sizes? In this way you don't compete in any credible space, nor local inference, nor remote inference since the model is far from SOTA and not cheap to serve. So why they are training such dense big models? Dense models have a place in the few tens of billion parameters, as Qwen 3.6 27B shows, but if you go 5 times that, it is no longer a fit, unless you are crushing with capabilities anything requiring the same VRAM, which is not the case.

Replies

zozbot234 • yesterday at 10:36 PM

Your GitHub link only says "The model quantized in this way behaves very very well in the chat, frontier-model vibes, but it was not extensively tested." This is hardly relevant to how it behaves in agentic workflows, we're aware of how often they degrade severely with Q2 quantization. If this quantized Flash can keep up reasonable quality and performance at larger context lengths (which seems to be a key feature of the V4 series) it could be a very reasonable competitor to models in the same weight class like Qwen 3 Coder-Next 80B.

➕ show 1 reply

alt Hacker News

Replies