Divide the value before the B by 2, and there's your answer if you get a Q4_K_M quant. Plus a b...

arcanemachiner • today at 3:43 PM • 0 replies • view on HN

Divide the value before the B by 2, and there's your answer if you get a Q4_K_M quant. Plus a bit of room for KV cache.

TLDR: If you have 14GB of VRAM, you can try out this model with a 4-bit quant.

Tokens per second is an unreasonable ask since every card is different, are you using GGUF or not, CUDA or ROCm or Vulkan or MLX, what optimizations are in your version of your inference software, flags are you running, etc.

Note that it's a dense model (the Qwen models have another value at the end of the MoE model names, e.g. A3B) so it will not run very well in RAM, whereas with a MoE model, you can spill over into RAM if you don't have enough VRAM, and still have reasonable performance.

Using these models requires some technical know-how, and there's no getting around that.

alt Hacker News