I created "apfel" https://github.com/Arthur-Ficial/apfel a CLI for the apple on-device local foundation model (Apple intelligence) yeah its super limited with its 4k context window and super common false positives guardrails (just ask it to describe a color) ... bit still ... using it in bash scripts that just work without calling home / out or incurring extra costs feels super powerful.
LLMs on device is the future. It's more secure and solves the problem of too much demand for inference compared to data center supply, it also would use less electricity. It's just a matter of getting the performance good enough. Most users don't need frontier model performance.
Get turboquant 4 bit implemented and this would be game changer.
Good to see Ollama is catching up with the times for inference on Mac. MLX powered inference makes a big difference, especially on M5 as their graphs point out. What really has been a game changer for my workflow is using https://omlx.ai/ that has SSD KV cold caching. No longer have to worry about a session falling out of memory and needing to prefill again. Combine that with the M5 Max prefill speed means more time is spend on generation than waiting for 50k+ content window to process.
Why are people still using Ollama? Serious.
Lemonade or even llama.cpp are much better optimised and arguably just as easy to use.
Already running qwen 70b 4-bit on m2 max 96gb through llama.cpp and it's pretty solid for day to day stuff. The mlx switch is interesting because ollama was basically shelling out to llama.cpp on mac before, so native mlx should mean better memory handling on apple silicon. Curious to see how it compares on the bigger models vs the gguf path
How does it compare to some of the newer mlx inference engines like optiq that support turboquantization - https://mlx-optiq.pages.dev/
What are significant differences between Ollama and LM Studio now? I haven’t used Ollama because it was missing MLX when I started using LLM GUIs.
What would be the non Mac computer to run these models locally at the same performance profile? Any similar linux ARM based computers that can reach the same level?
> Please make sure you have a Mac with more than 32GB of unified memory.
Yeah, I can still save money by buying a cheaper device with less RAM and just paying my PPQ.AI or OpenRouter.com fees .
still waiting for the day I can comfortably run Claude Code with local llm's on MacOS with only 16gb of ram
"We can run your dumbed down models faster":
#The use of NVFP4 results in a 3.5x reduction in model memory footprint relative to FP16 and a 1.8x reduction compared to FP8, while maintaining model accuracy with less than 1% degradation on key language modeling tasks for some models.
Finally! My local infra is waiting for it for months!
What is the difference between Ollama, llama.cpp, ggml and gguf?
Really nice to see this!
[dead]
[dead]
[dead]
This is huge for local-first AI, especially for privacy-sensitive workloads like memory systems.
*Why local memory inference matters:*
When building agents with long-term memory, "where does my data live?" becomes critical.
Even with E2EE (MemoryLake uses 3-party encryption so no single entity holds all keys), some users want memory extraction to happen entirely on-device.
*Architecture we're converging on:* 1. Local extraction (MLX/Ollama) – process docs/audio/video on-device 2. Encrypted sync – store structured memory in cloud 3. Centralized orchestration – multi-hop reasoning across memory graph
*Why hybrid:* - Local: Protects raw PII during extraction - Cloud: Enables cross-device access + powerful reasoning over full memory graph
MLX makes sub-1B domain models practical for local memory extraction. We've tested MemoryLake-D1 quantized to 4-bit on M3 – still hits 98%+ accuracy at 40 tokens/sec.
The performance gap between x86 and Apple Silicon for this workload is dramatic (3-5x faster).
I have an M4 Max with 48GB RAM. Anyone have any tips for good local models? Context length? Using the model recommended in the blog post (qwen3.5:35b-a3b-coding-nvfp4) with Ollama 0.19.0 and it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world". Is this the best that's currently achievable with my hardware or is there something that can be configured to get better results?