I recently wrote up how I run local LLMs, because several folks had asked (

SwellJoe • yesterday at 8:17 PM • 1 reply • view on HN

I recently wrote up how I run local LLMs, because several folks had asked (https://swelljoe.com/post/how-i-run-local-llms/) and I think even my setup, which I spent maybe $4200 on, half on a Strix Halo and half on upgrades for my desktop, would be too expensive to justify today. I bought before prices went through the roof, and only did so because I like to tinker with hardware...not because I expected it to ever pay for itself vs. buying subsidized tokens from the big guys or the cheap tokens from efficient providers like DeepSeek.

Buying four $13000 GPUs and several thousand dollars worth of supporting hardware seems crazy. This supply shortage has to end eventually, and I can buy billions of DeepSeek, MiMo, and GLM tokens, and use $100 or $200 a month subscriptions for the big guys in the meantime for the difference in price once that happens. And, you can't even run the full-sized GLM on that hardware, it is quantized and so is your KV cache; the degradation is small, but not non-existent. You're not running a model that's equal to what you get when you buy GLM tokens from Z.ai.

My recommendation for self-hosting is this: If you already have a 24GB or 32GB GPU, or two, or a recent Mac with 32GB or more, run the appropriate quantization of Qwen 3.6 27B or Gemma 4 31B. If your hardware is older and too slow for that, use the MoE, but know it'll be dumber. Use the tiny model for the stuff that doesn't need deep smarts: Research (give it a Brave or Exa MCP for web search), summarization, simple Python scripts for basic tasks, simple websites or web apps, categorization of stuff (I used Gemma 4 to review my past writing for friendliness and helpfulness), etc. It can also be a sub-agent for bigger agents (for those same kinds of tasks). Gemma 4 12B is an incredibly good model for its size, particularly for vision tasks, and in the 4-bit quantization (7GB on disk) it runs on anything, even a modern tablet or phone.

And, if you don't already have a big GPU or unified memory Mac, just wait. Use the cheap tokens every AI company wants to sell you, for now. A Claude or Codex or Gemini subscription is a good deal. Tokens from DeepSeek are a good deal, especially with Reasonix agent (which maximizes caching, which DeepSeek is uniquely good at, and cached tokens are uniquely cheap at DeepSeek). GLM is Good Enough and has a cheap coding plan. MiMo has the cheapest tokens for a 1T+ model in the game, though DeepSeek and GLM are better models, MiMo is fine.

When prices come down, I'll be speccing out a beast to run the big models, too. But, I'm not paying 4x for RAM and GPU and storage, and y'all shouldn't either. That's crazy. Computer prices go down over time. It is the law.

Replies

CamperBob2 • today at 12:00 AM

Buying four $13000 GPUs and several thousand dollars worth of supporting hardware seems crazy

Especially when you realize you really want 8 of them. But...

You're not running a model that's equal to what you get when you buy GLM tokens from Z.ai.

... to be perfectly clear: you have no earthly idea what you're getting when you buy GLM tokens from Z.ai. Your options are to run locally, rent cloud hardware, or hope for the best.

➕ show 1 reply

alt Hacker News

Replies