I'd say adding another 16Gb gpu would be worth it - you'd be able to run larger model/...

skolos • yesterday at 4:04 PM • 1 reply • view on HN

I'd say adding another 16Gb gpu would be worth it - you'd be able to run larger model/larger context all within gpu's. It would give you more options of what you can run fast. Your current model probably doesn't run completely from GPU (depending on quants I don't think you can squeeze Gemma4:26b into 16Gb vram), so you already have some layers running on gpu and some on cpu. If you add another gpu you might be able to move all layers to vram which should speed up things for you. The layers calculations happen on whatever gpu's it sits, so the layers that are already on your rtx5080 would compute same, but the layers that currently your cpu handles will be computed with faster vram/compute of rtx5060.

Replies

edb_123 • today at 1:18 AM

Thanks! I'm seeing a 10/90 split between CPU/GPU with gemma4:26b, so I guess there's at least something to win there by adding the other GPU. And perhaps something to win by connecting the monitor to the iGPU instead to free up VRAM, from what I gather.

Just in case someone should be interested in how a consumer PC setup like this performs, still using only 1x RTX 5080 + 64GB system RAM and Intel Ultra 270K-Plus; I tested Qwen3.6:35b-a3b now (using ollama and default settings) and I'm getting around ~86 t/s. The lowest I've seen so far is 70 t/s. The CPU/GPU split with 35b is 39/61% (with 4K 165 fps monitor connected to 5080, so there's probably some room for optimization here by moving it to the iGPU).

Best thing is that this setup is basically dead silent (it could, hypothetically speaking, be running in my bedroom just fine, and I'm a light sleeper).

alt Hacker News

Replies