I think they are training in an affine space, but I see what you're saying. The initialization of the bias must be breaking the symmetry in a way that makes the origin special. Of course to some degree that's unavoidable since we have to initialize on distributions with compact support.
I think they are training in an affine space, but I see what you're saying. The initialization of the bias must be breaking the symmetry in a way that makes the origin special. Of course to some degree that's unavoidable since we have to initialize on distributions with compact support.