logoalt Hacker News

proxysnatoday at 3:40 PM2 repliesview on HN

Qwen3.5-27B with a 4bit quant can be run on a 24G card with no problem. With 2 Nvidia L4 cards and some additional vllm flags, i am serving 10 developers at 20-25tok/sek, off-peak is around 40tok/sek. Developers are ok with that performance, but ofc they requested more GPU's for added throughput.


Replies

tandrtoday at 4:18 PM

What would be these additional vllm flags, if you don't mind sharing?

show 1 reply
PcChiptoday at 5:13 PM

question: why not use something like Claude? is it for security reasons?

show 2 replies