I'd love to see the prompt processing speed difference between 16× H100 and 2× Mac Studio.

embedding-shape • today at 10:12 AM • 2 replies • view on HN

Replies

Prompt processing/prefill can even get some speedup from local NPU use most likely: when you're ultimately limited by thermal/power limit throttling, having more efficient compute available means more headroom.

Barathkanna • today at 10:20 AM

I asked GPT for a rough estimate to benchmark prompt prefill on an 8,192 token input. • 16× H100: 8,192 / (20k to 80k tokens/sec) ≈ 0.10 to 0.41s • 2× Mac Studio (M3 Max): 8,192 / (150 to 700 tokens/sec) ≈ 12 to 55s

These are order-of-magnitude numbers, but the takeaway is that multi H100 boxes are plausibly ~100× faster than workstation Macs for this class of model, especially for long-context prefill.

➕ show 1 reply

alt Hacker News

Replies