On vllm with a 5090 I get 120-180TPS with the awq 4 bit quant + MTP speculative decoding For gemma...

VHRanger • yesterday at 6:29 PM • 0 replies • view on HN

On vllm with a 5090 I get 120-180TPS with the awq 4 bit quant + MTP speculative decoding

For gemma4 26B, same quantization, I get >200TPS.

Also note that qwen is extremely inefficient in reasoning; the reasoning chains are ~3x longer than gemma on average

alt Hacker News