I would be interested to know what speeds you can get from gemma4 26b + 31b from this machine. also ...

anko • today at 3:02 AM • 2 replies • view on HN

I would be interested to know what speeds you can get from gemma4 26b + 31b from this machine. also how rocm compares to triton.

Replies

rdslw • today at 8:28 AM

## performance data for token generation using lmstudio

- gemma4-31b normal q8 -> 5.1 tok/s

- gemma4-31b normal q16 -> 3.7 t/s

- gemma4-31b distil q16 -> 3.6 t/s

- gemma4-31b distil q8 -> 5.7 tok/s (!)

- gemma4-26b-a4b ud q8kxl -> 38 t/s (!)

- gemma4-26b-a4b ud q16 -> 12 t/s

- gemma4-26b-a4b cl q8 -> 42 t/s (!)

- gemma4-26b-a4b cl q16 -> 12 t/s

- qwen3.5-35b-a3b-UD@q6_k -> 52 t/s (!)

- qwen3.5-35b-a3b-uncensored-hauhaucs-aggressive@q8_0 -> 34 tok/s (!)

- qwen3.5-35b-a3b-uncensored-hauhaucs-aggressive@bf16 -> 11 tok/s

- qwen3.5-27b-claude-4.6-opus-reasoning-distilled-v2 q8 -> 8 tok/s

- qwen3.5 122B A10B MXFP4 Mo qwen3.5-122b-a10b (q4) -> 11 tok/s

- qwen3.5-122b-a10b-uncensored-hauhaucs-aggressive (q6) -> 10 tok/s

SwellJoe • today at 4:28 AM

Currently running Gemma 4 26B A4B 8-bit quantization, reasoning off, and the most recent job performed thus (which seems about average, though these are short running tasks, <2 seconds for each prompt):

prompt eval time = 315.66 ms / 221 tokens ( 1.43 ms per token, 700.13 tokens per second)

eval time = 1431.96 ms / 58 tokens ( 24.69 ms per token, 40.50 tokens per second)

total time = 1747.62 ms / 279 tokens

With reasoning enabled, it's about a quarter or fifth of that performance, quite a lot slower, but still reasonably comfortable to use interactively. The dense model is even slower. For some reason, Gemma 4 is pretty slow on the Strix Halo with reasoning enabled, compared to similar other models. It reasons really hard, I guess. I don't understand what makes models slower or faster given similar sizes, it surprised me.

Qwen 3.5 and 3.6 in the similar sized MoE versions at 8-bit quantization are notably faster on this hardware. If I were using Gemma 4 31B with reasoning interactively, I'd use a smaller 6-bit or even 5-bit quantization, to speed it up to something sort of comfortable to use. Because it is dog slow at 8-bit quantization, but shockingly smart and effective for such a tiny model.

Edit: Here's some benchmarks which feel right, based on my own experiences. https://kyuz0.github.io/amd-strix-halo-toolboxes/

alt Hacker News

Replies