logoalt Hacker News

adam_arthurtoday at 4:18 PM5 repliesview on HN

Gemma 4 is particularly good at pipeline/automation tasks.

It outperforms all the Qwen models (even 100B+) for rule following/automation style tasks in my experience. Its image interpretation is also very good, and out-benchmarks Opus.

Qwen seems to ignore instructions and consistently outputs incorrect formats (when token generation format is not explicitly constrained)

But yes, on the DGX Spark Gemma 31B Q4 with MTP runs around 20 tok/s and Gemma 26B A4B around 60 tok/s. Still quite slow. But on a high end Nvidia card would run significantly faster and still fit in memory.

I'd recommend for anyone getting into local models to focus on memory bandwidth over RAM. Models under 100B parameters are now sufficient and hugely useful for automation.

I agree that for coding/creation use cases, there's still not a compelling argument for local models.

But e.g. if you want to scan a list of stocks and interpret news/high pass filtering, interpreting logs, interpreting screenshots, the local models are more than sufficient already.


Replies

dstryrtoday at 5:07 PM

This is not my experience at all. Even the Nous Research guys have stated that "Qwen3.6-27B is the canonical local model to use Hermes Agent with" [https://old.reddit.com/r/LocalLLaMA/comments/1sz2y76/ama_wit...]. I am finding the same when used with Pi and OpenCode.

Gemma will just stop mid-tool call. It's been slower and I've had to reduce context size to run it. Qwen3.6 27b has been rock solid using club 3090's single card setup for agentic use -- https://github.com/noonghunna/club-3090/blob/master/docs/SIN...

show 1 reply
trouve_searchtoday at 4:59 PM

On a 5090, gemma4 26B runs at 350TPS with the command below [1] and gemma4 31B is around 150TPS with a similar command.

I'm really surprised how much slower a DGX spark is for the same price.

1. Here's my command.

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \ vllm serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit \ --dtype auto \ --gpu-memory-utilization 0.95 \ --kv-cache-dtype fp8 \ --enable-chunked-prefill \ --enable-prefix-caching \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser gemma4 \ --reasoning-parser gemma4 \ --max-num-batched 16000 \ --max-model-len 64000 \ --max-num-seqs 12 --speculative-config '{"model": "./gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 4}'

show 1 reply
ozimtoday at 7:03 PM

I was expecting DGX Spark to run Gemma 31b Q4 much faster.

I was expecting it would run Q8 in 50 tok/s.

I guess that’s good I stopped thinking about buying it because I would be disappointed.

gopher_spacetoday at 5:11 PM

In my mind it’s a question of knowing what you want to build and how to divide the project into tasks your local setup can handle.

If you don’t need the machine to respond instantly (or explain your own business model to you) everything can be local and it’s been like that for a few years now.

msp26today at 5:51 PM

Yep agreed completely. I couldn't imagine torturing myself with a small model for local coding. But Gemma 4 31B is so fucking good for a variety of language modelling tasks.