yeah, then theres prompt loading too. but anyone who can fit QWEN-3.6 35B with a sustained ~30 tok...

cyanydeez • today at 4:58 PM • 2 replies • view on HN

yeah, then theres prompt loading too.

but anyone who can fit QWEN-3.6 35B with a sustained ~30 token/s and ~100k context with cache could print money as a hardware vendor.

Replies

upboundspiral • today at 7:21 PM

with llama-cpp and offloading non-active experts (from MOE architecture) to cpu RAM, you can easily run 50 tok / s QWEN-3.6 35B on 8-12 GB of VRAM. KV cache is a few GB, experts are ~3-5 GB (assuming q8 quant from Unsloth for example).

You can scroll through r/localllama and find tons of people getting useable speeds out of Qwen 35B.

24 tok / second on an ancient 1080ti

https://old.reddit.com/r/LocalLLaMA/comments/1tcc7h5/24_toks...

100 tok / second on a 4070

https://old.reddit.com/r/LocalLLaMA/comments/1tjh7az/110_tok...

wmf • today at 5:13 PM

That just sounds like a 3090.

➕ show 1 reply

alt Hacker News

Replies