I wonder what motivates apple to release features like RDMA which are purely useful for server clusters, while ignoring basic qol stuff like remote management or rack mount hardware. It’s difficult to see it as a cohesive strategy.
Makes one wonder what apple uses for their own servers. I guess maybe they have some internal M-series server product they just haven’t bothered to release to the public, and features like this are downstream of that?
Hey Jeff, wherever you are: this is awesome work! I’ve wanted to try something like this for a while and was very excited for the RDMA over thunderbolt news.
But I mostly want to say thanks for everything you do. Your good vibes are deeply appreciated and you are an inspiration.
What is the max token throughput when batching. Lots of agentic workflows (not just vibe coding) are running many inferences in parallel.
It seems like every time someone does an AI hardware “review” we end up with figures for just a single instance, which simply isn’t how the target demographic for a 40k cluster are going to be using it.
Jeff, I love reading your reviews, but can’t help but feel this was a wasted opportunity for some serious benchmarking of LLM performance.
A good part of humanities knowledge under your desk running with a few old light bulbs worth of power
Linux already has RDMA support but it cannot yet use Thunderbolt. It's probably quite a bit of work to add everything that's required. Is anyone working on it?
It would be great to have this for those cheap Strix Halo boxes with 128GB quad channel DDR5-8000 for using two or three of them with their 2 USB4 ports (which are Thunderbolt capable) to fit larger models.
I was impressed by the lack of dominance of Thunderbolt:
"Next I tested llama.cpp running AI models over 2.5 gigabit Ethernet versus Thunderbolt 5"
Results from that graph showed only a ~10% benefit from TB5 vs. Ethernet.
Note: The M3 studios support 10Gbps ethernet, but that wasn't tested. Instead it was tested using 2.5Gbps ethernet.
If 2.5G ethernet was only 10% slower than TB, how would 10G Ethernet have fared?
Also, TB5 has to be wired so that every CPU is connected to every other over TB, limiting you to 4 macs.
By comparison, with Ethernet, you could use a hub & spoke configuration with a Ethernet switch, theoretically letting you use more than 4 CPUs.
Anyone else getting ERR_TOO_MANY_REDIRECTS trying to access the post?
I'd be interested in seeing numbers that split out the speed of reading input (aka prefill) and the speed of generating output (aka decode). Those numbers are usually different and I remember from this Exo article that they could be quite radically different on Mac hardware: https://blog.exolabs.net/nvidia-dgx-spark/
The "all nodes connecting to all other nodes" setup reminds me of NUMALink, the interconnect that SGI used on many (most? all?) of their supercomputers. In an ideal configuration, each 4-socket node has two NUMALink connections to every other node. As Jeff says, it's a ton of cables, and you don't have to think of framing or congestion in the same way as with RDMA over Ethernet.
The largest nodes in his cluster each have 512GB RAM. DeepSeek V3.1 is a 671B parameter model whose weights take up 700GB RAM: https://huggingface.co/deepseek-ai/DeepSeek-V3.1
I would have expected that going from one node (which can't hold the weights in RAM) to two nodes would have increased inference speed by more than the measured 32% (21.1t/s -> 27.8t/s).
With no constraint on RAM (4 nodes) the inference speed is less than 50% faster than with only 512GB.
Am I missing something?
As Jeff states there are really no Thunderbolt switches which currently limits the size of the cluster.
But would it be possible to utilize RoCE with these boxes rather than RDMA over Thunderbolt? And what would the expected performance be? As I understand RDMA should be 7-10 times faster than via TCP. But if I understand it correctly RoCE is RDMA over Converged Ethernet. So using ethernet frames and lower layer rather than TCP.
10G Thunderbolt adapters are fairly common. But you can find 40G and 80G Thunderbolt ethernet adapters from Atto. Probably not cheap - but would be fun to test! But ieven if the bandwidth is there we might get killed with latency.
Imagine this hardware with a PCIe slot. The Infiniband hardware is there - then we "just" need the driver.
Really cool article, I liked these details that weren't exactly related to the thesis:
- the mysterious disappearance of Exo
- Jeff wants something like SMB Direct but for the Mac. Wait what? SMB Direct is a thing, wha?? I always thought networked storage was untrustworthy.
- A single M3 Ultra is fast for inference
- A framework desktop ai max 395 is only $2100
Now I have some more rabbit holes to jump down.
There have a been a couple videos/posts about this from other influencers today
Does anyone remember a guy here posting about linking Mac Studios with Thunderbolt for HPC/clustering? I wasn't able to find it with a quick search.
Edit: I think it was this?
In an ideal world, Apple would have released a Mac Pro with card slots for doing this kind of stuff.
Instead we get gimmicks over Thunderbolt.
> You have to click buttons in the UI.
I like doing development work on a Mac, but this has to be my biggest bugbear with the system.
The next Mac studio is going to be a top seller. I don’t think people want to drop $10k on a few M3s, but I think they will do it for the M6. Just hoping the DRAM shortage doesn’t ruin this plan.
> For example: did you know there's no way to run a system upgrade (like to 26.2) via SSH
I did not know this. I thought the `softwareupdate` command was built for this use case, and thought it worked over ssh. It sure looks like it should work, but I don’t have a mac I can try it on right now.
Very cool, I’m probably thinking too much but why are they seemingly hyping this now (I’ve seen a bunch of this recently) with no M5 Max/Ultra machines in sight. Is it because their release is imminent (I have heard Q1 2026) or is it to try and stretch out demand for M4 Max / M3 Ultra. I plan to buy one (not four) but would feel like I’m buying something that’s going to be immediately out of date if I don’t wait for the M5.
I wonder if there's any possibility that an RDMA expansion device could exist in the future - i.e. a box full of RAM on the other end of a thunderbolt cable. Although I guess such a device would cost almost as much as a mac mini in any case...
> Working with some of these huge models, I can see how AI has some use, especially if it's under my own local control. But it'll be a long time before I put much trust in what I get out of it—I treat it like I do Wikipedia. Maybe good for a jumping-off point, but don't ever let AI replace your ability to think critically!
It is a little sad that they gave someone an uber machine and this was the best he could come up with.
Question answering is interesting but not the most interesting thing one can do, especially with a home rig.
The realm of the possible
Video generation: CogVideoX at full resolution, longer clips
Mochi or Hunyuan Video with extended duration
Image generation at scale:
FLUX batch generation — 50 images simultaneously
Fine-tuning:
Actually train something — show LoRA on a 400B model, or full fine-tuning on a 70B
but I suppose "You have it for the weekend" means chatbot go brrrrr and snark
Wonder if support for RDMA will translate into support for things such as SMB Direct or if it's really only useful for RAM pooling
Any thoughts on the GB300 workstation with 768GB RAM (from NVIDA, Asus, Dell, ...)? Although many announcements were made it seems not to be available yet. It does have faster interconnects but will probably be much more expensive.
Wonder if RDMA support can translate to things like SMB direct or other RDMA adjacent things
rdma_ctl enable in 1tn parameter.
TL1 mount, where 1.5 TB allocate mac-mini server.
As much as i hate Apples attitude towards hackers and modifying systems. I have to commend them for building awesome features like this
> That's definitely fast enough for vibe coding, if that's your thing, but it's not mine.
Why even…?
BUILD AI has a post about this and in particular sharding k-v cache across GPUs, and how network is the new memory hierarchy:
https://buildai.substack.com/p/kv-cache-sharding-and-distrib...
https://m.youtube.com/watch?v=4l4UWZGxvoc
Seems like the ecosystem is rapidly evolving
I really hope AMD or Intel can get on the clue train and respond.
Intel in particular has half a decade of having extremely amazing Thunderbolt ports on their mobile chips, built in (alas not present on desktop chips, for shame). There's been not bad but not great thunderbolt host-to-host networking, that TCP can go over, but the system to system connectivity had been a total afterthought, not at all tuned for obvious smart readily available options like RDMA here. But nothing stops anyone from having better host-to-host protocols.
There are also so many smart good excellent next steps competitors could go for. CXL is showing up on server systems as a much lighter weight much lower latency transport that is PCIe PHY compatible but lighter weight. Adding this to consumer chips and giving even a third of a shit could blow what we see here out of the water. It could probably be done over USB4 & radically blast this bespoke RDMA capability.
Connectivity had been a bespoke special capability for too long. Intel did amazing with Xeon having integrated OmniPath 100Gb a long time ago, that was amazing, for barely any extra bucks. But the market didn't reward them kicking total ass and everyone gave up on connecting chips together. Today we are hostage to fantastically expensive shitty inefficient NIC that cost a crap ton of money to do a worse job, paying enormous penalty for not having the capability on chip, making at best asmedia io hubs do the USB4 dance a hip away from the CPU.
I really hope Intel can appreciate how good they were, see the threat of Apple kicking as here doing what Intel uniquely has been offering for half a decade with incredible Thunderbolt offerings on-chip (limited alas only to mobile chips). I hope AMD feels the heat and gets some god dMned religion and sees the pressure and thread: man they delivered so strong on PCIe lane counts but man they have been so so so slacking on io capabilities for so long, especially on consumer platforms, and Apple is using both their awesome awesome awesome on-chip memory here and their fan-tastic exceptional ability to care just even the tiniest bit about using the consumer interconnect (that already exists in hardware).
I really really really hope someone else other than Apple can ante up and care. There are so many wins to be had, so close. These companies feel so distracted from the plot. Fucking shame. Good on Apple for being the only mofos to a Tually seize the obvious that was just sitting here, they took no effort nor innovation. What a shame no other players are trying at all.
[dead]
[dead]
On Intel Motherboards, it's easy to find ones that can take 2TB of RAM, for example: https://www.supermicro.com/en/products/motherboard/x14sbw-tf
This seems suboptimal.
Wow. $40k for a friendly chat(bot)...
Hey, at least this post allows us to feel as though we spent the money ourselves.
Bravo!
My expectations from M5 Max/Ultra devices:
- Something like DGX QSFP link (200Gb/s, 400Gb/s) instead of TB5. Otherwise, the economies of this RDMA setup, while impressive, don't make sense.
- Neural accelerators to get prompt prefill time down. I don't expect RTX 6000 Pro speeds, but something like 3090/4090 would be nice.
- 1TB of unified memory in the maxed out version of Mac Studio. I'd rather invest in more RAM than more devices (centralized will always be faster than distributed).
- +1TB/s bandwidth. For the past 3 generations, the speed has been 800GB/s...
- The ability to overclock the system? I know it probably will never happen, but my expectation of Mac Studio is not the same as a laptop, and I'm TOTALLY okay with it consuming +600W energy. Currently it's capped at ~250W.
Also, as the OP noted, this setup can support up to 4 Mac devices because each Mac must be connected to every other Mac!! All the more reason for Apple to invest in something like QSFP.