I'm running the Q4_K_M quant on a xeon with 7x A4000s and I'm getting about 8 tok/s w...

Gracana • yesterday at 8:21 PM • 2 replies • view on HN

I'm running the Q4_K_M quant on a xeon with 7x A4000s and I'm getting about 8 tok/s with small context (16k). I need to do more tuning, I think I can get more out of it, but it's never gonna be fast on this suboptimal machine.

Replies

segmondy • yesterday at 9:41 PM

you can add 1 more GPU so you can take advantage of tensor parallel. I get the same speed with 5 3090's with most of the model on 2400mhz ddr4 ram, 8.5tk almost constant. I don't really do agents but chat, and it holds up to 64k.

➕ show 1 reply

esafak • yesterday at 9:01 PM

The pitiful state of GPUs. $10K for a sloth with no memory.

alt Hacker News

Replies