> But with more advanced optimizers the gradient is not really used directly. It gets per weight normalization, fudged with momentum, clipped, etc.
Why would these things be "fudging"? Vanishing gradients (see the initial batch norm paper) are a real thing, and ensuring that the relative magnitudes are in some sense "smooth" between layers allows for an easier optimization problem.
> So really, how important is computing the exact gradient using calculus, vs just knowing the general direction to step? Would that be cheaper to calculate than full derivatives?
Very. In high dimensional space, small steps can move you extremely far from a proper solution. See adversarial examples.