Qwen 3.6 27B is the sweet spot for local development

496 points • by stared • today at 5:05 PM • 432 comments • view on HN

Comments

I love my MacBook Pro M5 128GB RAM and I love qwen3.6.

BUT DO NOT buy this MacBook if you plan on doing serious coding using local LLMs with it. The reason is simple: your fingers will burn and your head will explode from the noise.

Running any kind of sophisticated job on the very laptop you are using is just not viable. Sure you can use it in clamshell mode, but forget touching it while working with AI coding or agents.

If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement - or at least a few meters from your desk. Connect to it over LAN or Tailscale. The MacMini will also cost you almost 1/3 of the MacBook Pro.

Thank me later.

➕ show 16 replies

mashygpig • today at 10:21 PM

It's fun to run a model locally, but I don't think the economics make sense for anyone just trying to use models atm. It's absurdly cheap to use the same model via openrouter in comparison.

Seriously, just put $10 into openrouter and play with models that are cheap but bigger than what you'd reasonably be able to run locally like deepseek v4 flash (unquantized). You'll be surprised by how far that $10 goes for a model better than what you'd be able to run. Even further on the model you would be able to run locally. Then think of how many long it would take to match the cost of spend + power on doing it locally...

➕ show 1 reply

bensyverson • today at 5:38 PM

The article is based on running Qwen 3.6 on a 128GB MacBook Pro. For reference, a 128GB MBP currently starts at $6699 USD [0]

Some people will be happy to pay that premium for privacy, but at roughly 10X the cost of a MacBook Neo, that money could also buy a lot of credits on OpenRouter or frontier labs.

[0]: https://www.apple.com/shop/buy-mac/macbook-pro/14-inch-space...

➕ show 24 replies

onion2k • today at 5:34 PM

None of the examples reflect 'real work', at least not what I'd consider real work. Being able to nail a zero-shot greenfield project is relatively easy even for a small model. There's not much context to build up and it can fall back to similar examples in the training data easily. So long as you're not asking it to invent something wholly new it'll probably manage.

The real test is whether or not it can work with your existing codebases. In my limited experiments Qwen 3.5 (maybe 3.6 is loads better) does OK on a Rust+React app, and less well on a C# monolith. Not to the point of being unusable but definitely poorly enough that I went back to Claude after 20 minutes. If I lost access to a cloud model and had to use Qwen instead I'd be visibly sad.

➕ show 6 replies

blagui • today at 10:24 PM

How you can do dev in 2026 using 64k context and without sub agents?

The benchmark seemed fine until I saw that.

If you use sub agents, they will overwrite the cache and each request will trigger full reprocessing. Have fun with that as it will crash the t/s metrics on each prefill on top of the max 64k including input + output is a major blocker.

If you push the context higher and add parallel slots the requirements will be far higher and the numbers less shiny.

doodlesdev • today at 6:13 PM

I feel like I'm going insane seeing people buy these 128gb MBP for thousands of dollars to run models that are objectively much worse than SOTA and spending so much more. The amount spent on a 128gb M5 MAX can buy you a damned new car here. What the hell am I missing? Are developers in other countries living in such different worlds?

(I'm aware the price is, in absolute terms, more expensive where I live compared to the USA. That reinforces what I think, because anyone sane that would've bought one of those in another country would sell them as soon as they landed here and save that money.)

➕ show 6 replies

zx76 • today at 7:41 PM

I see a lot of people writing about how expensive the hardware to run these local models is - but see no mentions of the Intel Arc Pro B50/B60/B70 which seem like decent value if you're not interested in Apple kit (as much as anything can be decent value in the current status quo).

I just got a B70 with 32GB RAM for the equivalent of $1200 (incl. sales tax and import duties to my non-US location, so presumably it could be cheaper elsewhere). The memory bandwidth is 608 GB/s. For M5 Max (32-core GPU) it's 460 GB/s and for M5 Max (40-core GPU) it's 614 GB/s. A 3090 is still faster at ~900 GB/s but you're getting 32GB VRAM for a lot less than equivalent Nvidia cards. It's about 1/3 the bandwidth of a 5090 for 1/3 the cost, but with the same 32GB VRAM. If you're interested in being able to run bigger quants with some context and stay on a lower budget then it's an appealing trade off.

I'm still exploring using these local models so don't want to spend the equivalent of $5 000 - $10 000 just to test it out. I don't mind slightly slower perf to do some experimentation more affordably.

I actually got an B50 16GB (with meager 70w TDP!) first to test an Intel card with my stack - it worked easily with Ubuntu & Vulkan. I'd read a lot about hassles and people writing them off as unusable but it seems like these are often with SYCL which doesn't even seem to outperform vulkan and so why bother? (The B50 was just $370 inclusive tax and duties). Literally `apt install` the vulkan libraries and it worked with default xe driver in 26.04 and the vulkan build of llama.cpp. The SR-IOV PF/VF also just works with qemu/kvm, no tricks required. Since I got it fwupdmgr has updated the firmware twice so Intel is presumably actually trying to support these products.

mips_avatar • today at 8:57 PM

I think the sweet spot right now is 2x 3090s and a pcie 4 motherboard with 64-128 gb of ddr4 ram, you can build this right now for $3k and it runs qwen 27b/35b stupid fast at int4.

cpburns2009 • today at 8:22 PM

Before you run and go purchase a unified memory computer (e.g., DGX Spark, Mac, Ryzen AI Max 395 / Strix Halo), be aware dense models generally run slow on these machines. Dedicated GPUs run dense models significantly better. Look for benchmarks for your prospective machine. If you really want one of these, you'll be better off running Qwen 3.6 35B or another sparse MoE model.

ctkhn • today at 8:20 PM

I have been running qwen 3.6 35b a3b with opencode on my macbook pro 16" with m3 max and 64gb ram, and it's been great for local planning and coding. To be honest I have been on and off wishing I had future proofed with the 128gb after seeing how powerful 64gb is. On the other hand, I also haven't run up against a wall with a model that is just slightly larger than qwen.

XCSme • today at 9:42 PM

Considering the cloud version, all three models compared in the article (Qwen 3.6 35BA3b, 3.6 27B and DeepSeek V4 Flash), have very similar performance[0], BUT on cloud, for some reason DeepSeek V4 Flash is 10-20x cheaper than the Qwen models.

If Qwen models are so much easier to run, why are the providers charging more than V4 Flash?

[0]: https://aibenchy.com/compare/qwen-qwen3-6-35b-a3b-medium/qwe... <-- compare how the three models draw hamsters svgs, lol

beastman82 • today at 5:47 PM

FWIW I'm running gemma4 31b on my 5090 and it's pretty great as well.

QAT, MTP, 128k context.

I liked Qwen 3.6 27b too, it just seems that Gemma4 is a bit underrated.

➕ show 3 replies

starefossen • today at 7:47 PM

We have have had the same experience (qwen3.6 rocks) when we are evaluating local models for our developers in the Norwegian Government https://github.com/navikt/mlx-workspace

0x0000000 • today at 5:33 PM

> ... on my Macbook Max M5 128 GB

Local development for who? How many of y'all are rocking 128GB of memory? Am I reading Apple's site correctly that it's a $10,000 laptop?

➕ show 6 replies

SamInTheShell • today at 9:48 PM

This is probably the first small model I got through some simple web game tests without having to reset the context. It tends to opt to overwrite an entire file instead of doing edits... which editing is where most of these small models fall apart along with getting stuck in repeating loops. Only 24k tokens in so far, it did some decent newbie work.

ljosifov • today at 8:03 PM

Running 27B dense model on M5 128GB is ok, but one can do better.

On M5 128GB one can make use of the ram and use sparse MoE. For example, DeepSeek-V4-Flash will fit, served by DwarfStar (https://github.com/antirez/ds4). One will probably improve 2x the token/sec speed, given DS4F 13B activated params in the MoE are ~1/2 of the ~27B of the dense Qwen.

27B Of the Qwen fit even on a cheaper 24GB card, e.g. amd 7900xtx (<$1K?) or slightly dearer nvidia 3090 (with cuda). With ~900 GB/s bandwidth they will likely be ~50% faster than the M5 with 600 GB/s.

➕ show 1 reply

max8539 • today at 10:15 PM

Running this model on a 48 GB memory MacBook Pro when offline, it performs its tasks, but of course, it’s slower than Claude or Codex.

pkroll • today at 8:33 PM

Since no one else posted it... I have open-webui pointed at a linux box with 128 gig of ram and an RTX Pro 6000, and after a couple of runs on trivia, had it do one of Open WebUI's conversation starters: "Show me a code snippet of a website's sticky header in CSS and JavaScript."

72.06 t/s. That's the full Qwen 3.6 27B model BF16, using MTP, running on Ollama. Yes I know I should bite the bullet and get vllm running on that box.

That was, also, at a 570 watt limit: I normally run a little less, but when I first tried this I actually forgot I had set the limit to 300 (it's a hot day, I figured why fight the A/C?), and at 300 watts the same question came back at 69.38 t/s. (The extra power matters more for compute bound things, the difference in generating LTX2.3 videos is considerably higher... but still not linear.)

mark_l_watson • today at 9:17 PM

I can come close to agreeing because queen-3.6-27b is my second favorite for local coding. I am using gemma4:26b-a4b-it-qat-48k (the "-48k" is from my modifying a model run with Ollama to always use a 48K context size). On a 32G Mac I use gemma4:26b-a4b-it-qat-48k and OpenCode and on my 16G MacBook Air I use gemma4:12b-it-qat-16k ("-16k" is my resizing context size) and little-coder. I break up projects into small libraries because local coding works better for me using small code bases.

I find that for local coding, I need to spend a lot of time building concise SKILLs for specific things I work on and try to only enable one or two skills per coding session.

To the author of the linked article nice job, and if you feel like adding to it, please add details on your setup.

➕ show 1 reply

marcuskaz • today at 8:47 PM

When is Amazon Bedrock going to get these newer models?

Offloading compute to them is much easier, except its still a limited set of open models. Most companies are already running in AWS, so it's an easy add, models run in a trusted location, just another line item on the Amazon bill. You don't have to talk anyone into signing up with a new vendor. Plus you don't have to worry about local hardware at all.

rhgraysonii • today at 5:28 PM

I have been having pretty good success with Qwen 3.5 9B for "nontrivial but not challenging work all things considered" -- it runs great on my 24gb unified memory m4 pro MacBook Pro. What do the baseline specs look like Mac-wise for getting this model to run? Am I looking at a 96gb? 128? 256?

➕ show 2 replies

simplyluke • today at 9:11 PM

The open source models have gotten heavily conflated with local development. While that is cool and I'm excited about the future of local LLMs, it is not necessary to play around with these models. Without shilling for companies I don't have a relationship with, there are a number of companies who will give you an API just like Anthropic/OpenAI and you pay per token, albeit much cheaper than the frontier labs.

I've been using the full GLM 5.2 model this way (through opencode) at work for the past week. It's quite impressive.

➕ show 1 reply

blopker • today at 6:10 PM

I've been working with local models for the past year. There's so many possibilities, but I don't think coding is one. Coding requires so many layers beyond inference; I spent so much time trying to replicate what Claude Code does end to end locally. Understanding all the layers and keeping up with the advancements feels like a slog. Even this article messes up and misunderstands what some of the settings are doing. Qwen in particular seems to work at first, then often gets stuck in thought loops when used for actual work.

However, text-to-speech, speech-to-text, and non-code LLM use cases are so useful to have local, and don't require big hardware.

Having a universal reliable inference engine interface, I think, is the big unlock that needs to happen before app devs can ship these features.

Personal concrete use case: meeting recording app. This uses Parakeet + Qwen to create local transcriptions and post-cleanup, respectively.

Right now this app has to download and manage all these models, then bundle an inference engine to run them. It's a lot of code that probably should belong to the OS, or at least a standard interface.

While apps can offload some of this to llama.cpp or a similar process over http, that's another set of setup for the user to do before they can have a useful app.

Anyway, if you're getting started on a Mac, I'd suggest trying out oMLX (https://github.com/jundot/omlx) before messing with llama.cpp. In particular they have community benchmarks so you can see what kind of performance you're likely to get: https://omlx.ai/benchmarks. I wished each one had more configuration details though.

➕ show 1 reply

jjcm • today at 6:13 PM

I'd also look at the qwopus distil if you're using qwen 3.6 27b. It's a nice refinement of the current 27b with slightly better stats.

Jackrong has a few different ones available depending on what you're trying to do: https://huggingface.co/Jackrong

RedCinnabar • today at 5:38 PM

Call me back when you can run these models on 16GB of RAM and any recent i5/i7. Until then, there’s no point on using these toy models.

➕ show 4 replies

kpw94 • today at 5:32 PM

> What it does:

> --jinja for tool calling support

Pretty sure this flag hasn't done anything for a while. It's enabled by default since ~November of last year

cloudengineer94 • today at 9:52 PM

I'm using Qwen and Gemma 4 locally and it's pretty good stuff, not frontier level but gets the job done.

christoff12 • today at 9:05 PM

I just burned 20 minutes because I wanted to play hex minesweeper: https://hexabomb.pgpln.app

Source: https://chatgpt.com/share/6a42dd8a-4e28-83e8-9ef7-6ba56d665c...

➕ show 1 reply

Otternonsenz • today at 6:16 PM

Is there any hope for people that cant even run 27B parameters, Qwen3.6 or otherwise? Are there any quantized models that do well with tool calling at smaller parameter sizes?

I do not have a crazy rig, a modest gaming one at that, but in trying to understand more about agents and their capabilities, I am SOL with my 16 GB of RAM and 8GB of VRAM. I can get most small, non tool calling models to perform well, but I've had major issues with anything over 9B doing anything more than reasoning (egregiously slow at higher parameter counts).

And so far, I cant get even Pi to extend itself or do any meaningful work with any of the models I currently can get to run.

➕ show 5 replies

IronWolve • today at 6:43 PM

I think things are moving fast, tested that new vibethink-3B, works on many small tasks/fast, and playing with ornith-35B with a draft vibethinker-3b as a draft gave me some good speed/results.

Was just trying to see how small I could go and get acceptable results, but yeah, larger Qwen 3.6 with MTP is going to be better. Cant wait to see how AI model (unsloth/local-llm/heretic/reaper/etc communities) are tweaking/engineering quality down into smaller models. Lots of new things coming out.

zedascouves • today at 8:25 PM

Just tried on some arduino code. after 10 minutes i got a list of improvements to my code.

I ran those throu opus saking if it was good advice and was not impressed:

I read the actual qr_scanner.ino. Short answer: partially, but I'd push back on most of it. That review reads like generic ESP boilerplate advice written against an imagined version of your code — several of its "fixes" are already in your file, and its headline "critical" claim misreads what the code does. Going point by point:...

recursivedoubts • today at 9:09 PM

I would like to offer someone the next openclaw: a GUI for the mac that allows people to manage and install local models with a single click, provides GUI tools for tweaking important aspects of them, and also provides a good command line interface to those models.

➕ show 1 reply

jboss10 • today at 8:13 PM

I don't understand the talk about how expensive the hardware is. These models can run on very old or old and low end. I've been running Qwen3.6-35B Q4 on an old 1080 GPU(8GB vram) with 32GB sys RAM. I have a i7-12700.

It does about 30 tok/s which is enough for me. It's about half what the online models do, but it's enough.

I've heard their 9B models are also good, but they aren't much faster if you have the ram and a nice cpu.

These qwen3.6 models are the first ones I find can do much. GPT OSS was good, and Gemma4 is better. Gemma knows more facts, but qwen3.6 is smarter.

➕ show 2 replies

MangoCoffee • today at 7:37 PM

Running LLMs locally for development doesn’t make sense to me. The hardware gets outdated in just a few years. Even hyperscalers replace their GPUs faster than they can buy them, plus the cost of running it locally, isn’t cheap. the cost saving just ain't there.

➕ show 3 replies

hoppp • today at 9:34 PM

Its feasible but that laptop is not available for most devs.

I do have access for a 64 gb ram mac mini but most people don't.

diseasedyak • today at 7:43 PM

I have 24GB of VRAM (via a RTX 4090) and run Qwen3.6-35b:iq4, so it's importance-aware quantization and isn't nearly as dumb as it sounds like, fitting the 35b into 18 GB so you have some left over. So far I've had no issues, other than it taking a while for things like image gen, which I found out if you're gonna do with any alacrity, just have a cloud model do it.

For anything else local, including writing some automation scripts and such, it works great.

➕ show 2 replies

hollowturtle • today at 9:06 PM

> Real work

Ok that's the part I'm interested in, don't care about minesweeper clones....

> Make a landing page selling candles for women that are into wellbeing and SPA.

can't be serious...

seemaze • today at 5:46 PM

I was interested to see that Qwen3.5-122B-A10B narrowly beat Qwen3.6-27B on Donato Capitella's SWEBench-verified-mini run with a similar 128GB UMA architecture.

https://pi-local-coding-bench.dev

➕ show 1 reply

HotGarbage • today at 5:22 PM

And AI companies will continue to buy up all the silicon to make this prohibitively expensive to run at home.

➕ show 1 reply

blueside • today at 7:02 PM

i have been trying several open source models for the last few years. running qwen 3.6 27b on my 4090 is the first local llm i have used that made me start to second question if anthropic and openai are actually worth the (already) insane valuations.

don't get me wrong, the frontier models are leaps and bounds ahead of what qwen/kimikgemma are doing - but i don't need to drive a ferrari to the grocery store everytime either.

aand16 • today at 5:22 PM

I've come from the future to say Qwen 3.7 27B is just around the corner and slaps!

➕ show 4 replies

cdnsteve • today at 7:59 PM

Checkout details on what this runs on for local AI here: https://tokenstead.ai/models/qwen3-6-27b

markdog12 • today at 6:14 PM

I've tested it extensively for actual local development for my job, and hard disagree here. It's a waste of time to use it. Wish it were not true.

➕ show 1 reply

narrator • today at 7:28 PM

In hindsight, the Mac 512gb for about $10k was a total steal given that to run GLM 5.2 you need a 4x H100 to get the necessary amount of VRAM. Yeah the h100 is 2 to 8 times faster, but it's $20k a month to rent a 4xH100 VPS.

dom96 • today at 6:41 PM

What do folks use to keep on top of new model releases that are appropriate to their system? i.e. the models that will actually work on the MacBook Pro with 48GB of RAM or whatever their specs are.

I've seen sites here and there but they feel like quick little toys that don't get updated, so they always suggest old models.

zerolines • today at 8:54 PM

Yup, been rocking theQwen3.6-35B-A3B-MTP-GGUF locally with 88tk/s it's amazing.

alansaber • today at 8:23 PM

Is qwen finetuned/RL'd on any agent harness? Or does it just work well enough off the bat with opencode?

v3ss0n • today at 9:05 PM

3.5 122B is much better. 27 B is bad at Long context and Svelte

mbgerring • today at 5:55 PM

Something I find really confusing from this post is the MLX versions of the model running much slower. As I understand it, these model versions are meant to take advantage of Apple Silicon and MacOS APIs, and should produce better/faster results. Any insight into what’s happening here?

blobbers • today at 5:29 PM

How does llama.cpp use the GPU efficiently as opposed to MLX?

Is there any way to use MLX and GPU at the same time? Or does memory become a big problem?

TBH, I never understood Apple hyping these neural cores because I didn't think anyone actually uses them except maybe certain photo/video editing software.

If I can generate voice at the same time as video, that would be useful.

➕ show 1 reply

alt Hacker News

Qwen 3.6 27B is the sweet spot for local development

Comments

🔗 View 20 more comments