Always a bit disappointed in the details in these kinds of threads. When you do get answers, they&#x...

ryandrake • today at 4:58 PM • 2 replies • view on HN

Always a bit disappointed in the details in these kinds of threads. When you do get answers, they're never specific enough to try out on your own. It'll be something like "I use Qwen 3.5 and get great results!" OK but what quantization are you using? What llama parameters? What context size? What GPU are you running it on, and how much VRAM does it have? Are you hosting it on a separate box, or running it locally on your dev machine? What coding agent tool are you using, and how is it configured / hooked up to the model?

Replies

riazrizvi • today at 5:52 PM

All you get here is some market signal from 1 or 2 posts if you already know how to do it. Most of these responses are garbage.

porkloin • today at 6:09 PM

I have good results with this setup:

Hardware:

- GPU: AMD 7900xtx, 24gb vram

- CPU: AMD 5950x, AM4

- RAM: 64gb DDR4 3600

Software:

- OS: Bazzite (atomic fedora - this machine is running Steam "big picture" mode on my TV when not in use for LLM tasks)

- Virtualization: Podman Quadlets, which allows me to run container images as managed systemd units

- Network: tailscale

- Inference: llama.cpp vulkan (better performance than ROCM, though I'm keeping an eye on it in the future)

- LLM API surface: llama-swap (running as a podman quadlet exposed via tailscale svc) allows running multiple models on a single endpoint.

- Web/Chat Access: open-webui (running as podman quadlet exposed via tailscale svc) allows me to access any of the models I'm using for coding harness access for chat/general purpose queries via web browser. I also have the "conduit" app for my iPhone that allows me to hit the same models from my phone.

Models:

- Qwen3.6-27B-MTP-UD-Q4_K_XL.gguf - Unsloth Q4 quant of the qwen 3.6 27B model weights, with MTP enabled. MTP is important as it improves the speed the model can run at.

- Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf - Unsloth Q4 quant of 35B-A3B. Not MTP right now because I was having some issues with it?

- gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf - Gemma 4, which I use sometimes via open-webui instead of Qwen, but I generally think Qwen does a better job

Flags (specific for Qwen 27b, since that's primary model):

- `-ngl 99` offload all layers to GPU

- `-c 80000` 80K context window. I'd like this to be higher, but since my GPU also has to run the desktop session for the machine, I need to leave some VRAM overhead to keep the desktop from OOM-ing

- `-np 1` single slot (no parallel request handling)

- `--no-context-shift` error instead of silently sliding the context window when full

- `--cache-reuse 256` reuse cached prefix in chunks of 256 tokens (prompt cache)

- `-b 2048` logical batch size (tokens per submission)

- `-ub 1024` physical micro-batch (per GPU pass)

- `--cache-type-k q8_0 --cache-type-v q8_0` symmetric 8-bit K/V cache. Q8 is as low as I've been able to go without getting some issues with tool calling

- `-fa on` flash attention

- `--spec-type draft-mtp` use the model's built-in MTP as the draft model

- `--spec-draft-n-max 3` propose up to 3 draft tokens per step

- `--spec-draft-n-min 0` allow zero drafts if confidence is low

- `--spec-draft-type-k q8_0 --spec-draft-type-v q8_0` KV quant for the draft path

- `--reasoning-format deepseek` parse <think> blocks in proper format

- `--chat-template-kwargs '{"enable_thinking": true}'` turns on Qwen's thinking mode on by default (clients can override)

- `--jinja` use the GGUF's Jinja chat template

- `--temp 0.6` moderate randomness (Qwen recommended value for coding)

- `--top-p 0.95` nucleus sampling (Qwen recommended value for coding)

- `--top-k 20` top-20 candidates (Qwen recommended value for coding)

- `--min-p 0.0 disabled (Qwen recommended value for coding)

Performance (27b, primary model):

- ~65t/s for token generation

- ~600 t/s for prompt processing.

- If these numbers don't mean much to you, perceptually this feels about on-par with cloud model speed, maybe slightly faster.

- ~30s cold start when swapping from a different model or starting up session from idle via llama-swap.

I have llama-swap set up to unload the model after 10 min of idle, because I sometimes use this machine for gaming as well. A little annoying, but a small price to pay to be able to use the machine for other stuff (gaming) when I'm not using it with coding tasks.

CLI/Harness:

- Crush harness (https://github.com/charmbracelet/crush) less feature rich than Claude Code, but with a smaller system prompt and better built-in LSP support. I point it at the tailnet DNS (https://llama.<tailnet>:<port>)

- Headroom (https://github.com/chopratejas/headroom) to maximize the 80k context window

- Exa MCP for web search (https://exa.ai/) this alone makes the model far more useable. It's shocking how often the official claude code or codex harness get botblocked on web fetches, and the results of a good web fetch can be the difference between a good turn and a bad turn.

A lot of people get hung up on whether Qwen 3.x models are "as smart as" some parallel Anthropic model. Most people seem to agree it's somewhere between Haiku 4.5 and Sonnet 4.5. Personally, I think the biggest thing that makes the Qwen 3.x series of models _feel_ good to use for coding workflows is that its the first time that tool calling actually works consistently on local models. If tool calling is busted even 5% of the time, it can totally ruin the flow. I think that's also why people tend to say the "harness is more important than the model" or whatever. I have a few other models set up but 27B with MTP is the best compromise of speed and quality that I've found.

This setup works well enough for me that I dropped my personal Claude Code subscription. At work I'm still using frontier models, but personally I don't feel like I need that much power for anything I work on in my personal life. I'm "lucky" that I made the random financially unwise choice to buy a 7900XTX in late 2022 for $1k as a gaming card. I had no clue it would actually be a pretty decent LLM card 3-4 years later.

Edit: sorry for the horrible formatting, I always forget that HN doesn't actually do markdown :(

➕ show 1 reply

alt Hacker News

Replies