Running local models is good now

649 points • by jfb • today at 2:36 PM • 308 comments • view on HN

Comments

Exact reason I'm building csuite.so, do check it out and let me know if you need early access!

I've just made a milestone on my project, moving away from AWS (budget) to self-hosted and the local models are so much faster than in the past. Beyond LLMs, having embeddings, image, video, audio gen available is crazy.

Running locally is the bar; it's hard to make these things a service which scales.

malkosta • today at 4:32 PM

The problem with QWEN is that it just can't edit files reliably, I had to hack Pi all over to reduce the pain, but still far from perfect...does Gemma 4 strugle on this?

daniban • today at 4:13 PM

With Apple silicon and now the RTX Spark there are real discussions whether local AI is the future. The only problem is Western open source models are so far behind. I genuinely feel there's a push to fix this. Gemma is getting more frequent releases and Nvdia is quietly creating very cool small models. I hope both the hardware and models catch up and local really does emerge.

ibizaman • today at 3:40 PM

Tangential but reading on mobile, the font size in the code snippets are all over the place. I actually have the same issue on my blog. Anyone knows why?

fl4regun • today at 4:21 PM

In my experience, with a system of 32GB RAM and 24GB VRAM, no, they aren't that good.

fg137 • today at 4:02 PM

> I have a 2022 M2 Mac with 64 GB RAM

I closed the article after that.

The author has no idea what a privilege it is to have a machine like that for personal use, and how 99% of the population are not going to afford a setup like that.

Just some back-of-the-envelope maths will tell you that a $20/month Claude subscription makes much more sense financially.

➕ show 1 reply

drchaim • today at 4:32 PM

really want to try local models, but I don't have the hardware yet. Probably I'm the only one here still using a Mac Mini m1 8gb 2020. :/

➕ show 1 reply

stared • today at 3:38 PM

I really recommend Qwen3.6 27B.

Make some tests, and its 8 bit version runs at 30tok/s when using llama.cpp with MTP and run on Macbook Max M5. I have 128 GB, but but 64 GB is well enough. https://github.com/stared/benching-local-llms-on-apple-silic...

When using benchmarks, it gives more-or-less the level of SotA mid-late 2025.

➕ show 2 replies

xienze • today at 3:35 PM

The big caveat here is that these local models require you to invest some time tweaking your harness, AGENTS.md, and skills in order to get things roughly to the level you'd expect. But something like Qwen3.6-27B with web search capabilities and a good set of skills really is impressive! Especially considering that you can go wild and not worry about token costs.

The other thing that people tend to gloss over is that you really do need to spend some $$$ on decent hardware. Yeah, you CAN run some 4-bit quant with heavily quantized cache on your 16GB card, but it's not going to be a great experience (I think this is where a lot of the "if you think it's gonna be any good, you're going to be disappointed" stuff comes from). Yes it's a lot of $$$ upfront but it's very much unknown when hardware prices are going to come back to reality. There's a lot of hopes and dreams that any minute now an H100 will be worth pennies because "that's how it's always been" w.r.t. computer hardware, but we are living in interesting times. So you can't just make the tired old assumptions that a Claude subscription over three years time will work out to be dramatically less than the value of some card three years from now. We STILL have basically anything with >=24GB VRAM appreciating in value, which is absolutely wild. What I'm saying is, the depreciation curve may very well be a lot less dramatic and fast than it used to be, going forward.

wasimxyz • today at 3:47 PM

https://canirun.ai

ZionBoggan • today at 4:55 PM

This is actually a really insightful post !

jingw222 • today at 5:08 PM

open source must win

holoduke • today at 5:40 PM

Good? My Macbook m3 with 36gb locked up after it filled all memory with Gemma4. A bit useful yes. But it eats all resources. For local models to be useful we need at least 128gb of system memory and 512gb of video memory. Plus 8 times the compute of a single 5090/h200

monegator • today at 4:08 PM

I've been trying local models for the boring stuff you might be thinking about: writing small docs.

So i've tested a couple, and the speed is finally impressive. My colleague uses paid tiers of claude and GPT, and the speed is comparable. Maybe even slightly faster on my end.

The problem is: i'm running the model on my work laptop, a 12th gen i5 with 16GB of RAM (which, you know, i asked to upgrade to 64, but that was right at the time of the great RAM shortage of the '20s) so i'm pretty limited in what i can use. And this is running alongside the usual suspects: Web browser hugging 1.5GB, MPLABX hugging 3, windows taking at least 5 just to sit idle, thermal throttled to 1GHz ... And yet its speed is comparable to a paid service. A lunch's worth of tokens vs a few cents of power.

So, what i found, what i fount... What i found is that i need AT LEAST 16k of context window, otherwise they will halt when i pass a small C file for analysis. And coding models will shit the bed with 4k. But we all know that, context size is King.

I found out that Qwen will keep looping while thinking, but that's not a surprise to you, either. But give it enough time and you will get an useful answer. I was hoping to using it as a better warning system for some languages, but i fear i need muuuch more context size, because i tried to feed a file that had a function with an endless loop:

At 4k context it almost shit the bed if i gave it just the offending function, then told it where to look at. At 16k context, with the whole file, it needed some guidance to what the problem was, and after 10-15 minutes of thinking it found the issue. Problem is, it kept second guessing itself for another 20 minutes on the same unrelated thing before giving the output. For which the fix was wrong, but the semanthic was correct. Good enough. Maybe it will be faster if i don't ask for a fix (which i didn't i just asked to look for a specific issue)

Wish i had 3 times the RAM so i can see what happens with more context.

Then i gave it the task to analyze a C file to make an API document. It took half an hour, but then i had a good starting point, which i had to keep changing because it would confuse commands with IDs and things like that.

This was the Qwen 3.5 9B model.

I then tested Gemma 4, being impressed at the tokens per second it gives on my Pixel 8A. Same tasks: same issues with short context, with long context it gave absolutely useless answers when looking at code, but it took 1/3 the time of qwen.

In producing documentation, instead, it was much faster, and it never hallucinated data. Good. in 15 minutes i had everything done.

Not bad for stuff running on a business laptop, while doing actual work.

Tomorrow i will try Qwen 3.6, let's see how it goes..

aplomb1026 • today at 6:18 PM

[flagged]

eugmai86 • today at 6:23 PM

[flagged]

RishiByte • today at 6:16 PM

[flagged]

kordlessagain • today at 3:24 PM

[dead]

maxothex • today at 4:01 PM

[flagged]

azzzxcc123 • today at 5:15 PM

[dead]

Veer_Pratap08 • today at 4:15 PM

[flagged]

Lapsa • today at 7:19 PM

[dead]

huflungdung • today at 5:42 PM

[dead]

Rekindle8090 • today at 5:42 PM

[dead]

iluvcommunism • today at 3:15 PM

[dead]

alt Hacker News

Running local models is good now

Comments