logoalt Hacker News

boroboro4today at 3:34 AM2 repliesview on HN

What does it mean it's better at nvfp4 training? What's different between training and inference to make this true?


Replies

Schiendelmantoday at 4:17 AM

We're getting to the limit of my understanding, but I believe most Blackwell users still usually run FP8 passes through the transformer engine - they'll just store weights at NVFP4. Nvidia has model-specific stabilization recipes for NVFP4 end to end, but they're taking fixes all the time.

Nvidia says Rubin should have fewer stability problems training with FP4 because of hardware changes - "adaptive compression". There will still be outlier instability inherently, but something they're designing in reduces the cost of managing it.

But yeah, grain of salt - we haven't seen this in practice.

fc417fc802today at 4:06 AM

I'm also puzzled by that statement. The issue with training is (as I understand it) one of precision and the associated numerical stability. You need enough bits in order for backprop to function correctly.

Of course there are techniques such as quantization aware training but I don't understand why a datatype would work for inference but not for that.

You can also abandon backprop entirely but that comes with a whole host of tradeoffs and again why would it work for inference but not for whatever alternative training regime you selected?