I'd love to see the prompt processing speed difference between 16× H100 and 2× Mac Studio.
I asked GPT for a rough estimate to benchmark prompt prefill on an 8,192 token input. • 16× H100: 8,192 / (20k to 80k tokens/sec) ≈ 0.10 to 0.41s • 2× Mac Studio (M3 Max): 8,192 / (150 to 700 tokens/sec) ≈ 12 to 55s
These are order-of-magnitude numbers, but the takeaway is that multi H100 boxes are plausibly ~100× faster than workstation Macs for this class of model, especially for long-context prefill.
Prompt processing/prefill can even get some speedup from local NPU use most likely: when you're ultimately limited by thermal/power limit throttling, having more efficient compute available means more headroom.