logoalt Hacker News

Gracanayesterday at 8:21 PM2 repliesview on HN

I'm running the Q4_K_M quant on a xeon with 7x A4000s and I'm getting about 8 tok/s with small context (16k). I need to do more tuning, I think I can get more out of it, but it's never gonna be fast on this suboptimal machine.


Replies

segmondyyesterday at 9:41 PM

you can add 1 more GPU so you can take advantage of tensor parallel. I get the same speed with 5 3090's with most of the model on 2400mhz ddr4 ram, 8.5tk almost constant. I don't really do agents but chat, and it holds up to 64k.

show 1 reply
esafakyesterday at 9:01 PM

The pitiful state of GPUs. $10K for a sloth with no memory.