Google is singlehandedly carrying western open source models. Gemma 4 31B is fantastic. However, i...

msp26 • yesterday at 6:06 PM • 2 replies • view on HN

Google is singlehandedly carrying western open source models. Gemma 4 31B is fantastic.

However, it is a little painful to try to fit the best possible version into 24GB vram with vision + this drafter soon. My build doesn't support any more GPUs and I believe I would want another 4090 (overpriced) for best performance or otherwise just replace it altogether.

Replies

srigi • yesterday at 8:50 PM

You could keep multimodal projector (understanding of audio, images & PDFs) in system RAM with `--no-mmproj-offload` in llama.cpp. Of course, then it is not accelerated with GPU, but you save its VRAM.

ActorNightly • yesterday at 6:20 PM

Qwen is still better that Gemma though. Also you can tune it more for different tasks, which means that you can prioritize thinking and accuracy versus inference speed.

➕ show 4 replies

alt Hacker News

Replies