> Even with help from the "world's best" LLMs, things didn't go quite as smoothly as we had expected. They hallucinated steps, missed platform-specific quirks, and often left us worse off.
This shows how little native app training data is even available.
People rarely write blog posts about designing native apps, long winded medium tutorials don't exist, heck even the number of open source projects for native desktop apps is a small percentage compared to mobile and web apps.
Historically Microsoft paid some of the best technical writers in the world to write amazing books on how to code for Windows (see: Charles Petzold), but now days that entire industry is almost dead.
These types of holes in training data are going to be a larger and larger problem.
Although this is just representative of software engineering in general - few people want to write native desktop apps because it is a career dead end. Back in the 90s knowing how to write Windows desktop apps was great, it was pretty much a promised middle class lifestyle with a pretty large barrier to entry (C/C++ programming was hard, the Windows APIs were not easy to learn, even though MS dumped tons of money into training programs), but things have changed a lot. Outside of the OS vendors themselves (Microsoft, Apple) and a few legacy app teams (Adobe, Autodesk, etc), very few jobs exist for writing desktop apps.
Great effort, a strong self-hosting community for LLMs is going to be similarly important as the FLOSS movement imho. But right now I feel the bigger bottleneck is on the hardware side rather than software. The amount of fast RAM that you need for decent models (80b+ params) is just not something that's commonly available for consumer hardware right now, not even gaming machines. I heard that Macs (minis) are great for the purpose, but you don't really get them with enough RAM or at prices that don't really qualify as consumer-grade anymore. I've seen people create home clusters (eg using Exo [0]), but I wouldn't really call it practical (single digit token/sec for large models, and the price isn't exactly accessible either). Framework (the modular laptop company) has announced a desktop that can be configured up to 128GB unified RAM, but it's still going to come in at around 2-2.5k depending on your config.
This is something that I think about quite a bit and am grateful for this write-up. The amount of friction to get privacy today is astounding.
I think I still prefer local but I feel like that's because that most AI inference is kinda slow or comparable to local. But I recently tried out cerebras or (I have heard about groq too) and honestly when you try things at 1000 tk/s or similar, your mental model really shifts and becomes quite impatient. Cerebras does say that they don't log your data or anything in general and you would have to trust me to say that I am not sponsored by them (Wish I was tho) Its just that they are kinda nice.
But I still hope that we can someday actually have some meaningful improvements in speed too. Diffusion models seem to be really fast in architecture.
It's the hardware more than the software that is the limiting factor at the moment, no? Hardware to run a good LLM locally starts around $2000 (e.g. Strix Halo / AI Max 395) I think a few Strix Halo iterations will make it considerably easier.
Super cool and well thought out!
I'm working on something similar focused on being able to easily jump between the two (cloud and fully local) using a Bring Your Own [API] Key model – all data/config/settings/prompts are fully stored locally and provider API calls are routed directly (never pass through our servers). Currently using mlc-llm for models & inference fully local in the browser (Qwen3-1.7b has been working great)
I'm a little confused about your product branding vs. blog post?
From the product homepage, I imagine you're running VMs in the cloud (a la Firecracker).
From the blog post though, it looks like you're running Apple-specific VMs for local execution?
As someone who's built the former, I'd love the latter for use with the new gpt-oss releases :)
How would this compare to using Apple Foundation Models which execute on device?
I’m all for this. This is the first effort I’ve seen attempting to solve the full stack - most local solutions I’ve seen look so DIY that I don’t have much hope I’ll be able to properly configure and operate them dependably.
I think there’s room for an integrated solution with all the features we’re used to from commercial solutions: Web search (most important to me), voice mode (very handy), image recognition (useful in some cases), the killer feature being RAG on personal files.
Open Web UI is a great alternative for a chat interface. You can point to an OpenAI API like vLLM or use the native Ollama integration and it has cool features like being able to say something like “generate code for an HTML and JavaScript pong game” and have it display the running code inline with the chat for testing
Any way to install this via just a container?
Similar to a `docker compose up -d` that a lot of projects offer. Just download the docker-compose.yml file into a folder, run the command, and you're running. If you want to delete everything, just `docker compose down` and delete the folder, and the container and everything is gone.
Anything similar to that? I don't want to run a random install.sh on my machine that does god knows what.
The link to assistent ui in the article 404's. It should be https://github.com/assistant-ui/assistant-ui
Playing with local LLMs is indeed fun. I use Kasm workspaces[0] to run a desktop session with ollama running on the host. Gives me the isolation and lets me experiment with all manner of crazy things (I tried to make a computer-use AI but it wasn't very good)
I'm constantly tempted by the idealism of this experience, but when you factor in the performance of the models you have access to, and the cost of running them on-demand in a cloud, it's really just a fun hobby instead of a viable strategy to benefit your life.
As the hardware continues to iterate at a rapid pace, anything you pick up second-hand will still deprecate at that pace, making any real investment in hardware unjustifiable.
Coupled with the dramatically inferior performance of the weights you would be running in a local environment, it's just not worth it.
I expect this will change in the future, and am excited to invest in a local inference stack when the weights become available. Until then, you're idling a relatively expensive, rapidly depreciating asset.
Agree
I agree on this in every aspect
AI or any technology serve users locally will eventually empower users in a great manner because users can fully understand what they want.
Like a paper and pencil, which was not "cheap" in early history but eventually "local". AI or any technology will function the same way eventually.
why?
1. free to run and create (free == cheap, free == uncensored) 2. ambient everywhere
you might want to check out what we built -> https://inference.sh supports most major open source/weight models from wan 2.2 video, qwen image, flux, most llms, hunyan 3d etc.. works in a containerized way locally by allowing you to bring your own gpu as an engine (fully free) or allows you to rent remote gpu/pool from a common cloud in case you want to run more complex models. for each model we tried to add quantized/ggufs versions to even wan2.2/qwen image/gemma become possible to execute with as little as 8gb vram gpus. mcp support coming soon in our chat interface so it can access other apps from the ecosystem.
Its all about context and purpose, isn't it? For certain lightweight uses cases, especially those concerning sensitive user data, a local implementation may make a lot of sense.
> LLMs: Ollama for local models (also private models for now)
Incidentally, I decided to try to Ollama macOS app yesterday, and the first thing it tries to do upon launch is try to connect to some google domain. Not very private.
Here is my rig, running GLM 4.5 Air. Very impressed by this model
You can get good models that run fine on M1 32GB laptops just using Ollama App.
Or if you want numerous features on top of your local LLMS then Open WebUI would be my choice.
An llm on your computer is a fun hobby, an llm in your SME for 10 people is a business idea. There are not enough resources on this topic at all and the need is growing extremely fast. Local LLMs are needed for many use cases and business where cloud is not possible.
if you ever end up trying to take this in the mobile direction, consider running on-device AI with Cactus –
Blazing-fast, cross-platform, and supports nearly all recent OS models.
Agree
I agree on this in every aspect
AI or any technology serve users locally will eventually empower users in a great manner because users can fully understand what they want.
Like a paper and pencil, which was not "cheap" in early history but eventually local. "AI" or any technology will function the same way eventually.
why?
1. free to run and create (free == cheap, free == uncensored) 2. ambient everywhere
In the same boat. I love running things localhost. It's been great fun, and I learned tons I didn't know before. I know remote models API-s are a must for any serious work where tons is to be done, produced fr. Still it warms my heart every time llama-server runs on, and serves my aging mbp. Recent MoEs run great on macs with loads of v/ram, and the power efficiency is scarcely believable.
Thanks for sharing. Note that the GitHub at the end of the article is not working…
I tried to port it to Docker and wrote a blog here https://shekhargulati.com/2025/08/09/making-coderunner-ui-wo.... I used Claude Code to do the port. We used Datalayer Jupyter MCP Server instead of coderunner which uses Apple containers.
To OP, your link for https://github.com/assistant-ui/assistant-ui does not work.
Yep, that is something I do also actively experiment with in home projects. Local NAS (Synology) with 28TB of RAIDed storage, local containers and VMs on it and local gitea and other devops and productivity tools. All that talks to my mac which runs editing, compiling, etc and lmstudio with local agent. Not best always with AI, I lack enough RAM but close to imagine how I will work in the future, end-to-end
local/edge is the most under-valued space at the moment. incredible computing power that dwarfs datacenters, zero latency, zero cost, private, distributed and resilient
I’m trying to do something similar but hyper fine tune a model of choice for my specific local data source. For example, use existing code models to answer dquestions with code examples based on my private source files and documentation.
I tried doing it with using Huggingface and Unsloth but keep getting OOM errors.
Have anyone done this that runs locally against your own data?
Yea in an ideal world there would be a legal construct around AI agents in the cloud doing something on your behalf that could not be blocked by various stakeholders deciding they don't like the thing you are doing even if totally legal. Things that would be considered fair use, or maybe annoying to certain companies should not be easy for companies to just wholesale block by leveraging business relationships. Barring that, then yea, a local AI setup is the way to go.
That is fairly cool. I was talking about this on X yesterday: another angle however, I use a local web scraper and search engine via meilisearch the main tech web sites I am interested in. For my personal research I use three web search APIs, but there is some latency. Having a big chuck of the web that I am interested in available locally with close to zero latency is nice when running local models, my own MCP services that might need web search, etc.
It would be nice to have something more modest like a local offline foreign language translator.
Basically I'd like to be able to have an emacs "M-x translate-french-to-english" function. This should be easier than a full chat app but doesn't exist as far as I know.
Local AI is awesome, but without beefy hardware it’s like trying to run a marathon in flip-flops.
Halfway through he gives up and uses remote models. The basic premise here is false.
Also, the term “remote code execution” in the beginning is misused. Ironically, remote code execution refers to execution of code locally - by a remote attacker. Claude Code does in fact have that, but I’m not sure if that’s what they’re referring to.
Half-OT: Anything useful that runs reasonably fast on a regular Intel CPU/GPU?
On a similar vibe, we developed app.czero.cc to run an LLM inside your chrome browser on your machine hardware without installation (you do have to download the models). Hard to run big models, but it doesnt get more local than that without having to install anything.
Self hosted and offline AI systems would be great for privacy but the hardware and electricity cost are much too high for most users. I am hoping for a P2P decentralized solution that runs on distributed hardware not controlled by a single corporation.
Infra notwithstanding - I'd be interested in hearing how much success they actually had using a locally hosted MCP-capable LLM (and which ones in particular) because the E2E tests in the article seem to be against remote models like Claude.
https://github.com/adsharma/ask-me-anything
Supports MLX on Apple silicon. Electron app.
There is a CI to build downloadable binaries. Looking to make a v0.1 release.
a PC with rtx3090 is able to run many models locally with decent speed. or rtx4090 though it's more expensive(and power hungry)
Then using ollama is not the right choice.
I built TxtAI with this philosophy in mind: https://github.com/neuml/txtai
That's my vision, hope it can help. I think that if we combine all our personal data and organize it effectively, we can be 10 times more efficient. Long-term AI memory, all you speak and see will secretly be loaded to your own personal AI, and that can solve many difficulties, I think. https://x.com/YichuanM/status/1953886817906045211
At least you won't be needing a heater for the winter
We have this in closed alpha right now getting ready to roll out to our most active builders in the coming weeks at ThinkAgents.ai
What is the Apple hardware being used here? I see Apple Silicon but not the configuration.. what did I miss
I get it but I can’t get over the irony that you are using a tool that only works precisely because people don’t do this.
This is fantastic work. The focus on a local, sandboxed execution layer is a huge piece of the puzzle for a private AI workspace. The `coderunner` tool looks incredibly useful.
A complementary challenge is the knowledge layer: making the AI aware of your personal data (emails, notes, files) via RAG. As soon as you try this on a large scale, storage becomes a massive bottleneck. A vector database for years of emails can easily exceed 50GB.
(Full disclosure: I'm part of the team at Berkeley that tackled this). We built LEANN, a vector index that cuts storage by ~97% by not storing the embeddings at all. It makes indexing your entire digital life locally actually feasible.
Combining a local execution engine like this with a hyper-efficient knowledge index like LEANN feels like the real path to a true "local Jarvis."
Code: https://github.com/yichuan-w/LEANN Paper: https://arxiv.org/abs/2405.08051