> We quantize the pseudo-gradients to int8, reducing communication requirements by 400x.
Can someone explain if it does reduce the model quality overall?
> In our experiments, we found that we are able to perform int8 quantization on the pseudo gradients without any impact on the loss curves.
Allegedly not?
The gradients are noisy as they are, this additional noise probably does not hurt that much overall
To give some intuition here, it’s not crazy to think that getting a bunch of different 8 bit precision information intended to be combined would get you roughly 32 bits of precision. Especially when it’s not always (often?) the case that for a particular weight you’ll need the edges of that mantissa.