logoalt Hacker News

msp26yesterday at 6:06 PM2 repliesview on HN

Google is singlehandedly carrying western open source models. Gemma 4 31B is fantastic.

However, it is a little painful to try to fit the best possible version into 24GB vram with vision + this drafter soon. My build doesn't support any more GPUs and I believe I would want another 4090 (overpriced) for best performance or otherwise just replace it altogether.


Replies

srigiyesterday at 8:50 PM

You could keep multimodal projector (understanding of audio, images & PDFs) in system RAM with `--no-mmproj-offload` in llama.cpp. Of course, then it is not accelerated with GPU, but you save its VRAM.

ActorNightlyyesterday at 6:20 PM

Qwen is still better that Gemma though. Also you can tune it more for different tasks, which means that you can prioritize thinking and accuracy versus inference speed.

show 4 replies