how do you get 5x faster at inference when inference is memory bandwidth limited? getting 5x the memory bandwidth of a h100 seems physically difficult.
inference is only memory bandwidth limited when targeting higher tps / high single stream tps. the weights only need to be moved across once per forward pass, when you batch say 100 streams per forward pass (which is what most inference services do / care about) its compute bottlenecked.
Rubin has 22TB/s of memory bandwidth vs Blackwell's 8TB/s. NVLink 6 doubles interconnect speed. Plus they're moving to 3nm from ~4nm.
(Previously this comment said Rubin did native NVFP4, but Blackwell does too! Rubin just also trains with native NVFP4, which Blackwell does not.)