dang I wish I could share md tables.
Here’s a text edition: For $50k the inference hardware market forces a trade-off between capacity and throughput:
* Apple M3 Ultra Cluster ($50k): Maximizes capacity (3TB). It is the only option in this price class capable of running 3T+ parameter models (e.g., Kimi k2), albeit at low speeds (~15 t/s).
* NVIDIA RTX 6000 Workstation ($50k): Maximizes throughput (>80 t/s). It is superior for training and inference but is hard-capped at 384GB VRAM, restricting model size to <400B parameters.
To achieve both high capacity (3TB) and high throughput (>100 t/s) requires a ~$270,000 NVIDIA GH200 cluster and data center infrastructure. The Apple cluster provides 87% of that capacity for 18% of the cost.
For $50K, you could buy 25 Framework desktop motherboards (128G VRAM each w/Strix Halo, so over 3TB total) Not sure how you'll cluster all of them but it might be fun to try. ;)
What's the math on the $50k nvidia cluster? My understanding these things cost ~$8k and you can at least get 5 for $40k, that's around half a tb.
That being said, for inference mac still remain the best, and the M5 Ultra will even be a better value with its better PP.
Are you factoring in the above comment about as yet un-implemented parallel speed up in there? For on prem inference without any kind of asic this seems quite a bargain relatively speaking.
Apple deploys LPDDR5X for the energy efficiency and cost (lower is better), whereas NVIDIA will always prefer GDDR and HBM for performance and cost (higher is better).
what about a GB300 workstation with 784GB unified mem?
15 t/s way too slow for anything but chatting, call and response, and you don't need a 3T parameter model for that
Wake me up when the situation improves
You can keep scaling down! I spent $2k on an old dual-socket xeon workstation with 768GB of RAM - I can run Deepseek-R1 at ~1-2 tokens/sec.