Curious about the memory bandwidth constraints here. 20B parameters at 20fps seems like it would saturate the bandwidth of a single GPU unless you are running int4. I assume this requires an H100?
Yep, the model is running on Hopper architecture. Anything less was not sufficient in our experiments.
Yep, the model is running on Hopper architecture. Anything less was not sufficient in our experiments.