Two thoughts: > how important is computing the exact gradient using calculus Normally the gra...

blackbear_ • yesterday at 7:51 AM • 1 reply • view on HN

Two thoughts:

> how important is computing the exact gradient using calculus

Normally the gradient is computed with a small "minibatch" of examples, meaning that on average over many steps the true gradient is followed, but each step individually never moves exacty along the true gradient. This noisy walk is actually quite beneficial for the final performance of the network https://arxiv.org/abs/2006.15081 , https://arxiv.org/abs/1609.04836 so much so that people started wondering what is the best way to "corrupt" this approximate gradient even more to improve performance https://arxiv.org/abs/2202.02831 (and many other works relating to SGD noise)

> vs just knowing the general direction to step

I can't find relevant papers now, but I seem to recall that the Hessian eigenvalues of the loss function decay rather quickly, which means that taking a step in most directions will not change the loss very much. That is to say, you have to know which direction to go quite precisely for an SGD-like method to work. People have been trying to visualize the loss and trajectory taken during optimization https://arxiv.org/pdf/1712.09913 , https://losslandscape.com/

Replies

raindeer2 • yesterday at 8:49 AM

The first bit is why it is called Stochastic gradient decent. You follow the gradient of a randomly chosen minibatch at each step. It basically makes you "vibrate" down along the gradient.

alt Hacker News

Replies