Yep, I daily drive Qwen3.6-27B (including for work), have done pretty much since it came out. IMO it...

StevenWaterman • today at 3:32 PM • 6 replies • view on HN

Yep, I daily drive Qwen3.6-27B (including for work), have done pretty much since it came out. IMO it's the only (small-ish, local) model worth using, if you can run it. It might not be as good as Opus at "add X large feature" but I don't want that in a model. I want to do the thinking while it does the typing. And Qwen 3.6 27B is perfectly good at that (while in my experience models like the 35A3B and gemma are significant downgrades)

Plus, I never have to worry about rate limits, quotas, or sitting in a queue during peak time. And I can always see its full thoughts, don't have to worry about where my data is getting sent, and know it can't get secretly nerfed.

Running on 2x 3090, 500-1000tok/s prefill and 60tok/s output at Q6_K_XL with MTP on llama.cpp, 220k tokens context window (starts to get a bit dumb above 160k ish), no KV quantization

Replies

indoordin0saur • today at 3:47 PM

> And I can always see its full thoughts, don't have to worry about where my data is getting sent, and know it can't get secretly nerfed.

For this reason I wonder if local models are a potential business opportunity. Provide the service to engineering teams to give them a pre-built and setup GPU rig they can run in a closet. No need to worry about all the things you mentioned and clients can rest-assured their data isn't disappearing into a sketchy data center. There might be regulatory reasons that make on-prem setups appealing as well.

➕ show 3 replies

giancarlostoro • today at 3:39 PM

> (starts to get a bit dumb above 160k ish)

If open models can ever hold roughly 600k token windows, I'll be really excited, I found that around 300 ~ 400k of Claude reading through your codebase results in better outputs. I also have Claude read official docs instead of "guessing" as to how to do something.

➕ show 3 replies

hughw • today at 4:05 PM

Just this morning I tweaked my single 3090 setup too:

  OLLAMA_FLASH_ATTENTION=1
  OLLAMA_KV_CACHE_TYPE=q8_0
  OLLAMA_CONTEXT_LENGTH=180000

and that fits in 23GB.

[edited for format]

iamtheworstdev • today at 4:57 PM

are you running an NVLink? I have the same setup but no NVLink and it feels like it's best just splitting the 3090s to run separate models concurrently. But I also have no idea what I'm doing.

➕ show 1 reply

QuantumNoodle • today at 4:13 PM

Do you have any resources on hardware necessary for running models and tweaks? I see you mention 2x 3090 and I wanted to do more search on what hardware is satisfactory for what models.

Andrex • today at 5:33 PM

How long have you been using it?

alt Hacker News

Replies