Qwen3.5-27B with a 4bit quant can be run on a 24G card with no problem. With 2 Nvidia L4 cards and s...

proxysna • today at 3:40 PM • 2 replies • view on HN

Qwen3.5-27B with a 4bit quant can be run on a 24G card with no problem. With 2 Nvidia L4 cards and some additional vllm flags, i am serving 10 developers at 20-25tok/sek, off-peak is around 40tok/sek. Developers are ok with that performance, but ofc they requested more GPU's for added throughput.

Replies

tandr • today at 4:18 PM

What would be these additional vllm flags, if you don't mind sharing?

➕ show 1 reply

PcChip • today at 5:13 PM

question: why not use something like Claude? is it for security reasons?

➕ show 2 replies

alt Hacker News

Replies