logoalt Hacker News

simonwlast Friday at 9:34 PM4 repliesview on HN

I follow the MLX team on Twitter and they sometimes post about using MLX on two or more joined together Macs to run models that need more than 512GB of RAM.

A couple of examples:

Kimi K2 Thinking (1 trillion parameters): https://x.com/awnihannun/status/1986601104130646266

DeepSeek R1 (671B): https://x.com/awnihannun/status/1881915166922863045 - that one came with setup instructions in a Gist: https://gist.github.com/awni/ec071fd27940698edd14a4191855bba...


Replies

awnihannunlast Friday at 10:23 PM

For a bit more context, those posts are using pipeline parallelism. For N machines put the first L/N layers on machine 1, next L/N layers on machine 2, etc. With pipeline parallelism you don't get a speedup over one machine - it just buys you the ability to use larger models than you can fit on a single machine.

The release in Tahoe 26.2 will enable us to do fast tensor parallelism in MLX. Each layer of the model is sharded across all machines. With this type of parallelism you can get close to N-times faster for N machines. The main challenge is latency since you have to do much more frequent communication.

show 4 replies
anemlllast Saturday at 3:51 AM

Tensor Parallel test with RDMA last week https://x.com/anemll/status/1996349871260107102

Note fast sync workaround

andy99last Friday at 10:25 PM

I’m hoping this isn’t as attractive as it sounds for non-hobbyists because the performance won’t scale well to parallel workloads or even context processing, where parallelism can be better used.

Hopefully this makes it really nice for people that want the experiment with LLMs and have a local model but means well funded companies won’t have any reason to grab them all vs GPUs.

show 4 replies
CamperBob2last Saturday at 2:09 AM

Almost the most impressive thing about that is the power consumption. ~50 watts for both of them? Am I reading it wrong?

show 1 reply