RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

76 points • by iMil • today at 9:55 AM • 25 comments • view on HN

Comments

triwats • today at 4:46 PM

Potential specs:

NVIDIA GeForce RTX 5080: https://flopper.io/gpu/nvidia-geforce-rtx-5080-16gb

NVIDIA GeForce RTX 3090: https://flopper.io/gpu/nvidia-geforce-rtx-3090-24gb

sieste • today at 3:56 PM

That's almost exactly my setup and I'm very happy with its performance.

I noticed recently that I started to prefer my local Qwen3.6 35B A3B and pi agent over Claude Code.

Both fail at different tasks, and Qwen more so than Claude.

But the way Qwen fails is much more straightforward. In writing tasks Qwens hallucinations and bullshitting are much easier to spot because it doesn't have the sleek vocabulary and wordsmithing skills to disguise its ignorance.

In coding tasks that Qwen can't solve it often just goes into a tool calling doom loop that the pi harness can catch, whereas Claude attempts ever more convoluted and creative things just making more and more mess that takes forever to clean up.

I think part of the story is that the tasks for which I use AI are fairly simple and maybe don't need a frontier model. But I wonder if "proper" developers had similar experience?

➕ show 2 replies

ydj • today at 4:26 PM

80tp/s with 5080 3090 combo is wild. I’ve been working with a 4090 and two Tenstorrent p150 cards, and manage only about 30 tps utilizing all three for qwen3.6 27b q8. Guess I got more optimization to do.

Would like to see the perf of their setup with and without mtp and ngram speculative decoding though, as well as parallel decode performance (once llamacpp mtp plays well with multiple slots).

Being in California electricity alone puts this non-competitive with just paying a cloud though.

➕ show 2 replies

deng • today at 3:34 PM

I can understand the joy of running things yourself, and can also see the privacy aspect. However, I pay ~3$ per 1M/tokens for that model on Openrouter, and it's not even quantized. A refurbished 3090 and a 5080 will set you back well over 2k, not to mention the electricity to run them...

➕ show 8 replies

avyeed_desa • today at 3:34 PM

I just bought a $25 chinese 2x Oculink card and two Minis Forum DEG1, had some spare PSUs lying around, and just installed two cards on each. It works. I saw that there is also a 4x Oculink card, but i don't know it that will work, too.

ComputerGuru • today at 3:07 PM

I would have liked to see a bit more on the theory side of things, explaining optimal weight and inference splits, actual issues with existing drivers, etc instead of what’s essentially just a recipe.

alt Hacker News

RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

Comments