They will be, and that moment is not that far off. We've got the progression in place already: first, large data centers could have performant LLMs, we are now firmly in "a bunch of servers with a couple of H100s each" territory, slowly going into "128 GB VRAM on a MacBook Pro or a Strix Halo". Within the next year, the pattern of "expensive remote LLM for planning, local slow-but-faster-than-human LLM for execution" will become the norm for companies, slowly moving to "using local LLM for everything is good enough". And then we'll have the equilibrium we already have with the "classic cloud": you either self-host or pay for flexibility and speed. The question will be: how much of the current compute capacity craze will local hosting give the kiss of death to and what that means for the market.
Local AI will catch up. Unless we can't get our hands on hardware anymore, which is a legitimate concern I have.
Yet there is another post a few rows down where people are losing their shit that Chrome has a local LLM model that uses a couple of GB of space for local-inference.
Damned if they do, damned if they don't.
People are trying to “make the best software”, though.
I think the Quixotic accelerationists of AI are more or less a vocal minority of the people who make software, and the choice of online APIs over local systems is largely a choice made for users, rather than developer’s laziness.
You can do more and better with private AI today than with local models. There is no getting around that. Even if local AIs get better, being on the cutting edge of LLM performance is often a very worthy investment.
Most people won’t settle for a product if it’s not the very best and incredibly convenient. That’s a high bar, and local AI often doesn’t meet those standards.
HN’s insistence on treating all users like they are open-source, privacy-first, self-hosted Linux fanatics is painfully corny.
Entrenched interests are going to do everything to stop local, but there's at least a few technical reasons to believe small and specialized models could be the norm eventually. If that does happen, local will follow.
TFA is focused on whether big models are necessary for what users want. There's some evidence they may never actually be reliable enough unless a) mechanistic interpretation matures far enough or b) our multi-agent systems all become multi-model.
In the first case, advancement in MI might fix problems with big models, but would also mean we can maybe get unified representations, and just slice and dice the useful stuff out of huge models, getting only what we need without the junk. Only want logic? Only vision? Just cut it it out of the big monster and enjoy reduced costs and surface area for problems.
For (b), just look at stuff like the evil vector, or the category of hallucinations specific to tool-use. Without a complete solution for helpful/honest/harmless alignment, it seems likely that creativity and rigor (and many other things) are fundamentally at odds. If you start to need many models for everything anyway, why do we need the huge expensive do-everything ones? So specialization also becomes a pressure to shrink everything towards minimal reliable experts
I've got some demos of what the new Prompt API in Chrome that uses a local model can do: https://adsm.dev/posts/prompt-api/#what-could-you-build-with...
As OP says, it shines in constrained environments where the model is transforming user-owned data. Definitely less useful for anything more open-ended.
I think we should separate the private AI discussion from the local AI discussion. The pragmatic choice to run big LLMs is one/several big servers online, but that doesn't mean private companies should be the only ones to run them.
A self hosted inference solution that offer good tenant isolation guarantees (ideally zero trust) and is easy enough to deploy and maintain (think Plex for AI) would be my choice for privacy. Now to be honest I have done zero research about this and have zero idea how feasible that is, maybe it already exists and there's some discord servers I should join?
Edit: I don't need to mention it here but what's incredible is that open models are in the ballpark of the best commercial models so supposedly, the hardest part by far is already solved.
Local LLMs is the only thing viable and probably the only thing it will remain once the hype dies down.
A smaller cheaper local model can delivery most the value for coding, while we still use some services for code review and security compliance.
Once the VC money runs out and they start to charge the real price, the C-level will have to impose budges or limits. The current pissing contest over who can expend the most tokens is both ridiculous and shortsighted
The example in the post confirms my theory that for local models to succeed they need to be "good enough", not big enough that they can compete with frontier models.
They need to be able to do a small task well and they need to be able to run reasonably on consumer-class devices. Even better if they can run on mobile phones.
In my experiments with local LLMs I noticed that while increasing the size of the model is nice the real thing that turns a barely useless model into something useful is the ability to use tools. Giving my models the ability to search the web and fetch web pages did way more to solve hallucinations than getting a bigger model. And it doesn't have a training cutoff. Sure, the bigger model is probably better at using tools but I often find the smaller models to be good enough.
There was never a better time to run LLMs locally. It's just a few commands from zero till a fully working LLM homelab.
``` harbor pull unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL
# Open WebUI -> llama.cpp + SearXNG for Web RAG + OpenTerminal as sandbox harbor up searxng webui llamacpp openterminal ```
That's it, it's already better than Claude's or ChatGPT's app.
My problem with LLMs (apart from philosophical aspects and economical impact) is that it would be unlikely for any of us to be able to train something functional locally (toy-like LLMs -- sure, but something really useful -- no). Apart from that it requires immense computing power, it also requires a dataset which is for the most part is obtained illegally.
> Use cloud models only when they’re genuinely necessary.
The problem is that it's much easier to use the SOTA models (especially if they are subsidized) instead of spending time fixing the knobs with the local one.
I just realized this with coding agents, yeah, you probably shouldn't always use latest version at xhigh, but you will end doing it because you do the job in less time, with less "effort" and basically at the same price.
I guess we'll see a real effort for local AI only when major vendors will start billing based on actual token usage.
A local Answer Machine is the dream, especially when the internet is decaying and generally on its last legs, but the hardware requirements seem like a huge mountain to climb. Things are progressing tremendously - deepseek v4 flash is very good for what it is - but even that goes beyond any reasonable local setup, which imo is 128 GB ram + 16 GB vram. 4 ram slots on a consumer board craters ram speed, 256 gb macs are too expensive, and even then the inference is ungodly slow.
On the other hand… v4 flash model is actual magic compared to what was available 2 years ago. If the rate of improvement stays as is, we’ll get a similar performance in a ~120B model in a year, which is viable (if expensive) for everyman hardware. Possibly you’ll be able to run its equivalent on a ~$1200 laptop by 2028, which for me-in-2020 would sound straight out of a scifi movie. A good harness that lets the model fetch data from other sources like a local wikipedia copy from kiwix could do a lot for factual knowledge, too; there’s only so much you can encode in the model itself, but even a cheapish (pre-curent prices) 2TB drive can hold an immense amount of LLM-accessible data.
Big caveat: I don’t see local models for programming or generally demanding agentic tasks being worth it anytime soon. You likely want bleeding edge models for it, and speed is far more important. Chat at 20tok/s is fine; working on even a small codebase at 20tok/s, especially on a noticeably weaker model, is just a waste of time. Maybe it’s a PEBKAC but I have no idea how people make any meaningful use out of qwen 3.6.
Question: for software development, how much of an AI do you need for local development? Can it be run locally? Can someone train something that knows a lot about software but lacks comprehensive coverage of history, politics, and popular culture?
I would like a standardized API for local AI to exist outside of the Apple ecosystem. The Prompt API is Chrome is halfway there.
* What is the answer to local AI for native apps on Windows?
* What is the answer to local AI for Linux?
This is a big opportunity for Linux, given the high quality of open-weight models. I hope some answer emerges before designs fracture and we get a dozen mutually incompatible answers.
I wish I could upvote this twice. We (devs) really REALLY need to consider on-device compute before going to the cloud for LLM inference.
It feels like we're one technological breakthrough away from all of these data centers going up to be deemed irrelevant.
Overall I'm bullish on standardized local APIs that ship with the browser or platform. Far more tractable than expecting end users to stand up their own local model instances, though r/LocalLLaMA is a fantastic community to follow if you want to go that route.
A useful framing over “local vs cloud AI” can be split along two axes: does the task touch private data, and does it need frontier intelligence? You can use frontier models for developing the software (doesn’t touch data), but open-source models running locally for ops: maintenance, debugging and monitoring (touches data). If you need to fall back to frontier intelligence at some point for a particularly hard to resolve problem, you can still rely on local models for pre-transforming and filtering input in a way that's privacy-preserving or satisfies some constraint before it’s sent off to the cloud for processing. OpenAI's privacy filter is a good example of a model that can be used to mask PII and secrets and that can run locally: https://openai.com/index/introducing-openai-privacy-filter/, before sending any data externally for processing.
Another framing for local vs frontier closed which the article mentions is whether the task saturates model capability. With certain tasks like PDF processing or voice or summarization, adding more intelligence isn't necessarily useful. Arguably we've approached that point for chat interfaces already with frontier open-source models. But for coding and ops through well structured tool use inside a coding capable harness, we're still a ways away.
Tangentially, a contrarian take here is that AI can actually enable more privacy preserving software if you’re so inclined. You can just build personalized software and it lowers the barrier to entry and the effort required to self host. SaaS complexity often comes from scaling and supporting features for all types of customers, and if you're building software for personal use, you don't need all that additional complexity. Additionally, foundational and infra software that is harder to vibecode with AI is often already open source.
While I agree that would be the goal, we are too early for that. Just like how speech recognition used to require many server in a Datacenter to process and you send your data over. It is now completely on devices.
We are at least 5 years away from that. And DRAM needs a substantial breakthrough in cost reduction.
I mostly agree, though I think local AI will need better UX around failure modes. Cloud models are often used not just because developers are lazy, but because they are more capable and easier to support consistently across devices.
> We are building applications that stop working the moment the server crashes or a credit card expires
Isn’t this true of any application that accesses anything not running on your computer? This is just describing what it means to add an API call to your app. Nothing to do with AI (?)
> One of the current trends in modern software is for developers to slap an API call to OpenAI or Anthropic for features within their app.
Well there’s your problem, control needs to go the other way. If you want your app to be AI-enabled, you need to make it easy for AI to control your app. Have you used OpenClaw? It’s awesome!
Here I was hoping that this was some plea for us to get away from proprietary solutions that we have no control over and go back to open source, but no, not that at all.
Consumer/private needs to be local.
Work? I don't want it local at all. I want it all cloud agent.
Agreed, but the way ram prices are going, I don't think we would be able to afford hardware that can run any useful model.
I would love for local inference to be possible, but from my experience, Kimi 2.6 is the only model that would be worth it, and its a $10k (M3 Ultra max spec'd - 30s TTFT so kind of slowish) to $30k (RTX6000/700GB+ DDR5) upfront, noise / power consumption aside.
agree with the article but the limitation for local llm usefulness is the limited scope from my experiments. eventually context heavy data pipelines require larger models which consumer hardware can't deal with yet. the local model for summary on a page like you describe could be done via code as well, i've found using an llm isn't always the right choice. for example i use ner tagging in my md docs for better indexing and llm search capabilities. this is purely code based and not via an llm. tried with an llm and the results were a lot worse. augmenting tools to make the llm produce better outputs gives better results.
>> years ago I launched "The Brutalist Report"
proceeds to brutalise the reader with an 88-point headline font.
The shitty thing here is, either everyone's shipping 800 MB at least with their binary, or, you have to rely on the platform vendor anyway. I'm hoping there's enough external pressure that the OS vendors turn it more into a repository than a blessed-model-garden.
Until the hardware is economical and powerful enough, local AI that can compete with frontier models today is still far off.
If we could even get something like GPT 5.5 running locally that would be quite useful.
Two issues -
1. Local models are likely to be more power-expensive to run (per-"unit-of-intelligence") than remote models, due to datacenter economies of scale. People do not like to engage with this point, but if you have environmental concerns about AI, this is a pretty important one.
2. Using dumb models for simple tasks seems like a good idea, but it ends up being pretty clear pretty quick that you just want the smartest model you can afford for absolutely every task.
"NO AI" needs to be the norm, we should be working on better ways of sharing information and better documentation instead of fighting with computers for substandard results.
If you don't need a lot of smarts, do you even need an LLM? Aren't older machine learning techniques just as good, or like, you know, old-school algorithms?
I wonder if a popularization moment for local AI will ultimately be the pin-prick that pops the AI bubble. Like the deepseek or openclaw moments but bigger/next.
I've been looking into options for this and we are getting close. There are two main constraints: memory and memory bandwidth.
NVidia segments the market by limiting the amount of memory on GPUs. It currently tops out at 32GB (on a 5090) but it has excellent memory bandwidth (~1.8TB/s). If you want more than the you need to buy an RTX Pro (eg RTX 6000 Pro w/ 96GB for ~$10K) or you get into high high end solutions like H100, H200, etc that have significantly more memory and even higher bandwidth on HBM memory (eg 3.2TB/s+).
NVidia has released the DGX Spark w/ 128GB of memory for ~$4k. The problem is the memory bandwidth. It's only 273GB/s, which is less than the M5 Pro (307GB/s) but more than the M5. You can buy a 16" Macbook Pro with an M5 Max and 128GB of memory for $6k and it has a bandwidth of 614GB/s. So the DGX Spark is a joke, really.
In case it wasn't clear, Apple is interesting in this space because it has a shared memory architecture so the GPU can use all the memory.
Many, myself include, expect there to be no refresh to the 5000 series consumer GPUs this year, which would otherwise happen based on product cycles. So no 5080 Super, for example. And I wouldn't expect a 6090 before 2028 realistically.
One thing Apple hasn't done yet is release the M5 Mac Studios, which are widely expected in Q3 this year. They are interesting because, for example, the M3 Ultra has a memory bandwidth of 819GB/s and previously had a max spec of 512GB but that got discontinued (and the 256GB version also got discontinued more recently).
So many expect an M5 Max Mac Studio with 1TB/s+ bandwidth and specs up to 256GB or 512GB, probably for ~$10k later this year.
You really have to use this hardware almost 24x7 for it to be economical because otherwise H100 computer hours are probably cheaper.
But what happens when the next generation of GPUs comes out to the trillions in AI DC investment? It's going to halve its value. That's over $1 trillion in capex that will disappear overnight, effectively.
I think Apple is the dark horse here because they have no interest in NVidia's psuedo-monopoly. I'm just waiting for them to realize it.
Now CUDA is an issue here still but I think as time goes on it's going to be less of an issue. Memory is still a huge constraint both in terms of price and just general supply because NVidia can justify paying way more for it than you can, probably.
It's still sad to see that 128GB (2x64GB) DDR5 kits are almost $2k now and werre $400 a year ago. Expect that to continue until this bubble pops (which IMHO it will) and we're likely in a global recession.
So the other issue is models. OpenAI and Anthropic are built on proprietary models. Their entire valuation depends on this moat. I don't think this last so both companies are doomed because open source models are going to be sufficiently good.
We can already do some reasonably cool stuff on local hardware that isn't that expensive and even more so once you get to $5-10k hardware. That's going to be so much better in 2 years that I'm hesitant to spend any amount of money now.
Plus the code for running these things is getting better. Just in the last month there have been huge speed ups in local LLMs with MTP.
We need computers with 128gb or maybe even 192gb of memory before local use make sense. From my own experience 32b LLMs are the absolute minimum for proper tool use and decent output quality. But for local ai you want also vision models and maybe even various LLMs. Plus some memory for the system of course. On my 36gb M3 the 24b Gemma model is nice. But the entire system gets allocated for that thing.
Same as local compute.
Welcome back to 2014. Let us now continue yelling at the cloud.
Depending on some remote AI provider is a major lock-in pitfall. But it's exactly what those AI providers want you to do.
I'm someone who is trying to build a subscription-based business to cover underlying LLM costs, and very hopeful I can one day just sell a permanent license to the software instead with customers using local LLMs to power it.
I guess Google got that memo!
Local AI is a bit like wind parks. Everyone is in favor, except if they are in your own backyard. There was recently a huge outcry when Chrome shipped a local 4 GB AI model: https://news.ycombinator.com/item?id=48019219
I have to conclude that people would like to have powerful local AI but it should at the same time only be a tiny model. In which case it wouldn't be powerful.
[flagged]
[flagged]
[dead]
[dead]
Local models are extraordinarily expensive if you're not maximizing throughput, and you're not going to be maximizing it.
Local models need to be resident in expensive RAM, the kind that has fat pipes to compute. And if you have a local app, how do you take a dependency on whatever random model is installed? Does it support your tool calling complexity? Does it have multimodal input? Does it support system messages in the middle of the conversation or not? Is it dumb enough to need reminders all the time?
Spend enough time building against local models and you'll see they're jagged in performance. You need to tune context size, trade off system message complexity with progressive disclosure. You simply can't rely on intelligence. A bunch of work goes into the harness.
Meanwhile, third party inference is getting the benefits of scale. You only need to rent a timeslice of memory and compute. It's consistent and everybody gets the same experience. And yes, it needs paying for, but the economics are just better.
For the mainstream audience, the sentiment around local ai today is the same that they had around open source a few decades ago. For a few products, some paid solutions were so much more advanced that open source were very often completely overlooked. Why bother ? And the like. Then we had captive SaaS and other plateforms and now it's obviously wrong for most of us.
The dependency we have with anthropic and openai for coding for instance is insane. Most accept it because either they don't care, or they just hope chinese will never stop open weights. The business model of open weights is very new, include some power play between countries and labs, and move an absurd amount of money without any concrete oversight from most people.
It's a very dangerous gamble. Today incredible value is available for nearly everyone. But it may stop without any warning, for reason outside our control.