Does Gemma 4 31B run full res on Strix or are you running a quantized one? How much context can you ...

qingcharles • yesterday at 6:40 PM • 1 reply • view on HN

Does Gemma 4 31B run full res on Strix or are you running a quantized one? How much context can you get?

Replies

I'm running an 8 bit quant right now, mostly for speed as memory bandwidth is the limiting factor and 8 bit quants generally lose very little compared to the full res, but also to save RAM.

I'm still working on tweaking the settings; I'm hitting OOM fairly often right now, it turns out that the sliding window attention context is huge and llama.cpp wants to keep lots of context snapshots.

➕ show 1 reply

alt Hacker News

Replies