A guide to local coding models

370 points • by mpweiher • yesterday at 8:55 PM • 187 comments • view on HN

Comments

> I realized I looked at this more from the angle of a hobbiest paying for these coding tools. Someone doing little side projects—not someone in a production setting. I did this because I see a lot of people signing up for $100/mo or $200/mo coding subscriptions for personal projects when they likely don’t need to.

Are people really doing that?

If that's you, know that you can get a LONG way on the $20/month plans from OpenAI and Anthropic. The OpenAI one in particular is a great deal, because Codex is charged a whole lot lower than Claude.

The time to cough up $100 or $200/month is when you've exhausted your $20/month quota and you are frustrated at getting cut off. At that point you should be able to make a responsible decision by yourself.

➕ show 22 replies

Workaccount2 • yesterday at 10:09 PM

I'm curious what the mental calculus was that a $5k laptop would competitively benchmark against SOTA models for the next 5 years was.

Somewhat comically, the author seems to have made it about 2 days. Out of 1,825. I think the real story is the folly of fixating your eyes on shiny new hardware and searching for justifications. I'm too ashamed to admit how many times I've done that dance...

Local models are purely for fun, hobby, and extreme privacy paranoia. If you really want privacy beyond a ToS guarantee, just lease a server (I know they can still be spying on that, but it's a threshold.)

➕ show 5 replies

raw_anon_1111 • yesterday at 11:18 PM

I don’t think I’ve ever read an article where the reason I knew the author was completely wrong about all of their assumptions was that they admitted it themselves and left the bad assumptions in the article.

The above paragraph is meant to be a compliment.

But justifying it based on keeping his Mac for five years is crazy. At the rate things are moving, coding models are going to get so much better in a year, the gap is going to widen.

Also in the case of his father where he is working for a company that must use a self hosted model or any other company that needed it, would a $10K Mac Studio with 512GB RAM be worth it? What about two Mac Studios connected over Thunderbolt using the newly released support in macOS 26?

https://news.ycombinator.com/item?id=46248644

simonw • yesterday at 9:41 PM

This story talks about MLX and Ollama but doesn't mention LM Studio - https://lmstudio.ai/

LM Studio can run both MLX and GGUF models but does so from an Ollama style (but more full-featured) macOS GUI. They also have a very actively maintained model catalog at https://lmstudio.ai/models

➕ show 5 replies

NelsonMinar • yesterday at 10:05 PM

"This particular [80B] model is what I’m using with 128GB of RAM". The author then goes on to breezily suggest you try the 4B model instead of you only have 8GB of RAM. With no discussion of exactly what a hit in quality you'll be taking doing that.

➕ show 1 reply

cloudhead • yesterday at 9:31 PM

In my experience the latest models (Opus 4.5, GPT 5.2) Are _just_ starting to keep up with the problems I'm throwing at them, and I really wish they did a better job, so I think we're still 1-2 years away from local models not wasting developer time outside of CRUD web apps.

➕ show 1 reply

tempodox • today at 7:50 AM

> You might need to install Node Package Manager for this.

How anyone in this day and age can still recommend this is beyond me.

andix • yesterday at 10:49 PM

I wouldn't run local models on the development PC. Instead run them on a box in another room or another location. Less fan noise and it won't influence the performance of the pc you're working on.

Latency is not an issue at all for LLMs, even a few hundred ms won't matter.

It doesn't make a lot of sense to me, except when working offline while traveling.

➕ show 1 reply

maranas • yesterday at 10:02 PM

Cline + RooCode and VSCode already works really well with local models like qwen3-coder or even the latest gpt-oss. It is not as plug-and-play as Claude but it gets you to a point where you only have to do the last 5% of the work

➕ show 1 reply

SamDc73 • yesterday at 11:37 PM

If privacy is your top priority, then sure spend a few grand on hardware and run everything locally.

Personally, I run a few local models (around 30B params is the ceiling on my hardware at 8k context), and I still keep a $200 ChatGPT subscription cause I'm not spending $5-6k just to run models like K2 or GLM-4.6 (they’re usable, but clearly behind OpenAI, Claude, or Gemini for my workflow)

I was got excited about aescoder-4b (model that specialize in web design only) after its DesignArena benchmarks, but it falls apart on large codebases and is mediocre at Tailwind

That said, I think there’s real potential in small, highly specialized models like 4B model trained only for FastAPI, Tailwind or a single framework. Until that actually exists and works well, I’m sticking with remote services.

➕ show 1 reply

amarant • today at 5:01 AM

Buying a maxed out MacBook Pro seems like the most expensive way to go about getting the necessary compute. Apple is notorious for overcharging for hardware, especially on ram.

I bet you could build a stationary tower for half the price with comparable hardware specs. And unless I'm missing something you should be able to run these things on Linux.

Getting a maxed out non-apple laptop will also be cheaper for comparable hardware, if portability is important to you.

➕ show 3 replies

nzeid • yesterday at 9:28 PM

I appreciate the author's modesty but the flip-flopping was a little confusing. If I'm not mistaken, the conclusion is that by "self-hosting" you save money in all cases, but you cripple performance in scenarios where you need to squeeze out the kind of quality that requires hardware that's impractical to cobble together at home or within a laptop.

I am still toying with the notion of assembling an LLM tower with a few old GPUs but I don't use LLMs enough at the moment to justify it.

➕ show 1 reply

mungoman2 • today at 5:50 AM

The money argument is IMHO not super strong, here as that Mac depreciates more per month than the subscription they want to avoid.

There may be other reasons to go local, but I would say that the proposed way is not cost effective.

There's also a fairly large risk that this HW may be sufficient now, but will be too small in not too long. So there is a large financial risk built into this approach.

The article proposes using smaller/less capable models locally. But this argument also applies to online tools! If we use less capable tools even the $20/mo subscriptions won't hit their limit.

ineedasername • today at 12:11 AM

I’ve been using Qwen3 Coder 30b quantized down to IQ3_XSS to fit in < 16gb vram. Blazing fast 200+ tokens per second on a 4080. I don’t ask anything complicated, but one off scripts to do something I’d normally have to do manually by hand or take an hour to write the script myself? Absolutely.

These are no more than a few dozen lines I can easily eyeball and verify with confidence- that’s done in under 60 seconds and leaves Claude code with plenty of quota for significant tasks.

jszymborski • today at 4:19 AM

I just got a RTX 5090, so I thought I'd see what all the fuss was about these AI coding tools. I've previously copy pasted back and forth from Claude but never used the instruct models.

So I fired up Cline with gpt-oss-120b, asked it to tell me what a specific function does, and proceeded to watch it run `cat README.md` over and over again.

I'm sure it's better with other the Qwen Coder models, but it was a pretty funny first look.

➕ show 1 reply

NumberCruncher • today at 6:42 AM

I am freelancing on the side and charge 100€ by the hour. Spending roughly 100€ per month on AI subscriptions has a higher ROI for me personally than spending time on reading this article and this thread. Sometimes we forget that time is money...

fny • today at 1:23 AM

My takeaway is that clock is ticking on Claude, Codex et al's AI monopoly. If a local setup can do 90% of what Claude can do today, what do things look like in 5 years?

➕ show 2 replies

altx • today at 5:00 AM

Its interesting to notice that here https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com... we default to measuring LLM coding performance as how long[~5h] a human task a model can complete with 50% success-rate (with 80% fall back for the second chart [~.5h]), while here it seems that for actual coding we really care about the last 90-100% of the costly model's performance.

threethirtytwo • yesterday at 11:44 PM

I hope hardware becomes so cheap local models become the standard.

➕ show 1 reply

2001zhaozhao • today at 5:30 AM

Under current prices buying hardware just to run local models is not worth it EVER, unless you already need the hardware for other reasons or you somehow value having no one else be able to possibly see your AI usage.

Let's be generous and assume you are able to get a RTX 5090 at MSRP ($2000) and ignore the rest of your hardware, then run a model that is the optimal size for the GPU. A 5090 has one of the best throughputs in AI inference for the price, which benefits the local AI cost-efficiency in our calculations. According to this reddit post it outputs Qwen2.5-Coder 32B at 30.6 tokens/s. https://www.reddit.com/r/LocalLLaMA/comments/1ir3rsl/inferen...

It's probably quantized, but let's again be generous and assume it's not quantized any more than models on OpenRouter. Also we assume you are able to keep this GPU busy with useful work 24/7 and ignore your electricity bill. At 30.6 tokens/s you're able to generate 993M output tokens in a year, which we can conveniently round up to a billion.

Currently the cheapest Qwen2.5-Coder 32B provider on OpenRouter that doesn't train on your input runs it at $0.06/M input and $0.15/M output tokens. So it would cost $150 to serve 1B tokens via API. Let's assume input costs are similar since providers have an incentive to price both input and output proportionately to cost, so $300 total to serve the same amount of tokens as a 5090 can produce in 1 year running constantly.

Conclusion: even with EVERY assumption in favor of the local GPU user, it still takes almost 7 years for running a local LLM to become worth it. (This doesn't take into account that API prices will most likely decrease over time, but also doesn't take into account that you can sell your GPU after the breakeven period. I think these two effects should mostly cancel out.)

In the real world in OP's case, you aren't running your model 24/7 on your MacBook; it's quantized and less accurate than the one on OpenRouter; a MacBook costs more and runs AI models a lot slower than a 5090; and you do need to pay electricity bills. If you only change one assumption and run the model only 1.5 hours a day instead of 24/7, then the breakeven period already goes up to more than 100 years instead of 7 years.

Basically, unless you absolutely NEED a laptop this expensive for other reasons, don't ever do this.

brainless • today at 4:48 AM

I do not spend $100/month. I spend for 1 Claude Pro subscription and then a (much cheaper) z.ai Coding Plan, which is like one fifth the cost.

I use Claude for all my planning, create task documents and hand over to GLM 4.6. It has been my workhorse as a bootstrapped founder (building nocodo, think Lovable for AI agents).

➕ show 1 reply

mungoman2 • today at 5:41 AM

The money argument doesn't make sense here as that Mac depreciates more per month than the subscription they want to avoid.

There may be other reasons to go local, but the proposed way is not cost effective.

Myrmornis • today at 4:47 AM

Can anyone give any tips for getting something that runs fairly fast under ollama? It doesn't have to be very intelligent.

When I tried gpt-oss and qwen using ollama on an M2 Mac the main problem was that they were extremely slow. But I did have a need for a free local model.

➕ show 2 replies

ardme • yesterday at 10:44 PM

Isnt the math of buying Nvidia stock with what you pay for all the hardware and then just paying $20 a month for codex with the annual returns better?

➕ show 1 reply

ikidd • today at 6:24 AM

So I can't see bothering with this when I pumped 260M tokens through running in Auto mode on a $20/mo Cursor plan. It was my first month of a paid subscription, if that means anything. Maybe someone can explain how this works for them?

Frankly, I don't understand it at all, and I'm waiting for the other shoe to drop.

jollymonATX • today at 12:48 AM

This is not really a guide to local coding models which is kinda disappointing. Would have been interested in a review of all the cutting edge open weight models in various applications.

m3kw9 • today at 6:17 AM

Nobody doing serious coding will use local models when frontier models are that much better, and no they are not half a gen behind frontier. More like 2 gen.

freeone3000 • yesterday at 10:40 PM

What are you doing with these models that you’re going above free tier on copilot?

➕ show 1 reply

Bukhmanizer • today at 12:18 AM

Are people really so naive to think that the price/quality of proprietary models is going to stay the same forever? I would guess sometime in the next 2-3 years all of the major AI companies are going to increase the price/enshittify their models to the point where running local models is really going to be worth it.

holyknight • yesterday at 10:34 PM

your premise would've been right, if memory wouldn't skyrocketed like 400% in like 2 weeks.

dackdel • today at 3:49 AM

no one using exo?

BoredPositron • today at 12:04 AM

Not worth it yet. I run a 6000 black for image and video generation, but local coding models just aren't on the same level as the closed ones.

I grabbed Gemini for $10/month during Black Friday, GPT for $15, and Claude for $20. Comes out to $45 total, and I never hit the limits since I toggle between the different models. Plus it has the benefit of not dumping too much money into one provider or hyper focusing on one model.

That said, as soon as an open weight model gets to the level of the closed ones we have now, I'll switch to local inference in a heartbeat.

artursapek • today at 3:06 AM

Imagine buying hardware that will be obsolete in 2 years instead of paying Anthropic $200 for $1000+ worth of tokens per month

chrisischris • yesterday at 11:33 PM

[dead]

➕ show 1 reply

h0rmelchilly • today at 12:58 AM

[dead]

alt Hacker News

A guide to local coding models

Comments