Ollama is now powered by MLX on Apple Silicon in preview

317 points • by redundantly • today at 3:40 AM • 152 comments • view on HN

Comments

I have an M4 Max with 48GB RAM. Anyone have any tips for good local models? Context length? Using the model recommended in the blog post (qwen3.5:35b-a3b-coding-nvfp4) with Ollama 0.19.0 and it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world". Is this the best that's currently achievable with my hardware or is there something that can be configured to get better results?

➕ show 4 replies

franze • today at 7:55 AM

I created "apfel" https://github.com/Arthur-Ficial/apfel a CLI for the apple on-device local foundation model (Apple intelligence) yeah its super limited with its 4k context window and super common false positives guardrails (just ask it to describe a color) ... bit still ... using it in bash scripts that just work without calling home / out or incurring extra costs feels super powerful.

babblingfish • today at 4:40 AM

LLMs on device is the future. It's more secure and solves the problem of too much demand for inference compared to data center supply, it also would use less electricity. It's just a matter of getting the performance good enough. Most users don't need frontier model performance.

➕ show 14 replies

androiddrew • today at 11:14 AM

Get turboquant 4 bit implemented and this would be game changer.

Yukonv • today at 7:03 AM

Good to see Ollama is catching up with the times for inference on Mac. MLX powered inference makes a big difference, especially on M5 as their graphs point out. What really has been a game changer for my workflow is using https://omlx.ai/ that has SSD KV cold caching. No longer have to worry about a session falling out of memory and needing to prefill again. Combine that with the M5 Max prefill speed means more time is spend on generation than waiting for 50k+ content window to process.

robotswantdata • today at 7:48 AM

Why are people still using Ollama? Serious.

Lemonade or even llama.cpp are much better optimised and arguably just as easy to use.

LuxBennu • today at 5:01 AM

Already running qwen 70b 4-bit on m2 max 96gb through llama.cpp and it's pretty solid for day to day stuff. The mlx switch is interesting because ollama was basically shelling out to llama.cpp on mac before, so native mlx should mean better memory handling on apple silicon. Curious to see how it compares on the bigger models vs the gguf path

codelion • today at 4:50 AM

How does it compare to some of the newer mlx inference engines like optiq that support turboquantization - https://mlx-optiq.pages.dev/

daveorzach • today at 10:13 AM

What are significant differences between Ollama and LM Studio now? I haven’t used Ollama because it was missing MLX when I started using LLM GUIs.

harel • today at 8:15 AM

What would be the non Mac computer to run these models locally at the same performance profile? Any similar linux ARM based computers that can reach the same level?

➕ show 4 replies

janandonly • today at 9:37 AM

> Please make sure you have a Mac with more than 32GB of unified memory.

Yeah, I can still save money by buying a cheaper device with less RAM and just paying my PPQ.AI or OpenRouter.com fees .

➕ show 1 reply

dial9-1 • today at 4:51 AM

still waiting for the day I can comfortably run Claude Code with local llm's on MacOS with only 16gb of ram

➕ show 1 reply

mfa1999 • today at 5:41 AM

How does this compare to llama.cpp in terms of performance?

➕ show 1 reply

AugSun • today at 5:03 AM

"We can run your dumbed down models faster":

#The use of NVFP4 results in a 3.5x reduction in model memory footprint relative to FP16 and a 1.8x reduction compared to FP8, while maintaining model accuracy with less than 1% degradation on key language modeling tasks for some models.

puskuruk • today at 6:48 AM

Finally! My local infra is waiting for it for months!

brcmthrowaway • today at 5:30 AM

What is the difference between Ollama, llama.cpp, ggml and gguf?

➕ show 2 replies

darshanmakwana • today at 7:54 AM

Really nice to see this!

techpulselab • today at 8:11 AM

[dead]

charlotte12345 • today at 7:15 AM

[dead]

firekey_browser • today at 6:34 AM

[dead]

charlotte12345 • today at 7:24 AM

This is huge for local-first AI, especially for privacy-sensitive workloads like memory systems.

*Why local memory inference matters:*

When building agents with long-term memory, "where does my data live?" becomes critical.

Even with E2EE (MemoryLake uses 3-party encryption so no single entity holds all keys), some users want memory extraction to happen entirely on-device.

*Architecture we're converging on:* 1. Local extraction (MLX/Ollama) – process docs/audio/video on-device 2. Encrypted sync – store structured memory in cloud 3. Centralized orchestration – multi-hop reasoning across memory graph

*Why hybrid:* - Local: Protects raw PII during extraction - Cloud: Enables cross-device access + powerful reasoning over full memory graph

MLX makes sub-1B domain models practical for local memory extraction. We've tested MemoryLake-D1 quantized to 4-bit on M3 – still hits 98%+ accuracy at 40 tokens/sec.

The performance gap between x86 and Apple Silicon for this workload is dramatic (3-5x faster).

➕ show 1 reply

alt Hacker News

Ollama is now powered by MLX on Apple Silicon in preview

Comments