logoalt Hacker News

dnhknglast Thursday at 7:22 AM1 replyview on HN

Fair points, but the deal is still great because of the nuances of the RAM/VRAM.

The Blackwells are superior on paper, but there's some "Nvidia Math" involved: When they report performance in press announcements, they don't usually mention the precision. Yes, the Blackwells are more than double the speed of the Hopper H100's, but thats comparing FP8 to FP4 (the H100's can't do native FP4). Yes, thats great for certain workloads, but not the majority.

What's more interesting is the VRAM speed. The 6000 Pro has 96 GB of GPU memory and 1.8 TB/s bandwidth, the H100 haas the same amount, but with HBM3 at 4.9 TB/s. That 2.5X increase is very influential in the overall performance of the system.

Lastly, if it works, the NVLink-C2C does 900 GB/s of bandwidth between the cards, so about 5x what a pair of 6000 Pros could do over PCIE5. Big LLMs need well over the 96 GB on a single card, so this becomes the bottleneck.

e.g. Here are benchmarks on the RTX 6000 pro using the GPT-OSS-120B model, where it generates 145 tokens/sec, and I get 195 tokens/sec on the GH200. https://www.reddit.com/r/LocalLLaMA/comments/1mm7azs/openai_...


Replies

crapple8430last Thursday at 1:32 PM

The perf delta is smaller than I thought it'd be given the memory bandwidth difference. I guess likely comes from the Blackwell having native MXFP4, since GPT-OSS-120b has MXFP4 MOE layers.

The NVLink is definitely a strong point, I missed that detail. For LLM inference specifically it matters fairly little iirc, but for training it might.