"A great way to go is 2x RTX 3090s for a total of 48GB VRAM total. You can then run Qwen3.6-27B, which is an awesome model."
Just want to note that for $3k you can get an M5 macbook pro with 48gb of shared memory, and it will not be a giant box. Also, consider committing to spending that money on a cloud hosting provider, which will be at least somewhat cheaper if not significantly cheaper. It is awesome being able to run models locally though.
The cool thing about the 3090s is the RAM bandwidth. Token generation is mostly bottlenecked on memory bandwidth. Dual 3090s have 1.87 TB/s memory bandwidth (0.936 TB/s each), vs the M5 Macbook pro with only 0.3 TB/s (max chip has up to 0.63 TB/s but it's a $10k machine at that config).
This translates to qwen 27b actually working fast enough for useful work on dual 3090s and being painfully slow on Macbook Pros. Also if you're running a big model on a macbook pro the UI gets laggy and the keyboard gets hot. Much better to run dual 3090s in your basement and connect to them from your Macbook.
I have an M5 MacBook Pro and I also have a separate GPU setup for running models. The difference in speed is significant. It's not just token generation speed, but time to first token (prompt processing).
The M5 hardware is amazing for what it is, but GPUs are still so much faster.
Running the models on the GPU box also means I can use the laptop on my lap instead of turning it into a hot plate.
I'm running Qwen3.6-27B on a single 24GB GPU at 80 tok/s, you don't even need 2 of them
That's a reasonable option, just be aware that you get about 1/3 as much memory bandwidth with the M5 Pro, or 2/3 with the M5 Max [now you're at $4100 for the lowest-end]. So both your prefill (flops-bound, M5 has a lot less) and decode (bw-bound) will be slower.
You can also buy a Jetson Orin with 64GB of unified memory.
The standalone mini/studio is better if you dont want to have a constantly hot laptop
Get a regular laptop and use the network to access the LLM
To summarize a video I saw recently [0] rebutting your arguments, MacBooks can get crazy slow when running local models or even just Claude Code and Codex due to their poor implementation, to the point that the laptop itself becomes unusable.
There are other arguments for running an ssh-able box in a closet somewhere too as with KVMs you can give an agent remote control over the machine itself such that it has vastly more capabilities than if it were controlling its own machine it's running on, as well as not needing to keep the MacBook open all the time just to have the agent finish running.
I’m an idiot who is unable to project itself in situations I’ve never experienced before.
So, I always thought local LLMs were toys not worth pursuing.
Only once have I tried something decent like Gemma 4 31B and Qwen 3.6 27B did I realize how incredibly useful they are.
You stop fearing you are sharing sensitive information.
You stop fearing you will run out of tokens.
You stop fearing about the availability of the remote AI.
Local LLMs are extremely valuable.