So, are computing gradients details of backpropagation that it is failing to abstract over, or are gradients the goal that backpropagation achieves? It isn't both, its just the latter.
This is like complaining about long division not behaving nicely when dividing by 0. The algorithm isn't the problem, and blaming the wrong part does not help understanding.
It distracts from what is actually helping which is using different functions with nicer behaving gradients, e.g., the Huber loss instead of quadratic.
It’s just an observation. It’s an abstraction in the classical computer science sense in that you stack some modules and the backprop is generated. It’s leaky in the sense that you cant fully abstract away the details because of the vanishing/exploding gradient issues you must be mindful of.
It is definitely a useful thing for people who are learning this topic to understand from day 1.
> It distracts from what is actually helping which is using different functions with nicer behaving gradients, e.g., the Huber loss instead of quadratic.
Fully agree. It's not the "fault" of Backprop. It does what you tell it to do, find the direction in which your loss is reduced the most. If the first layers get no signal because the gradient vanishes, then the reason is your network layout: Very small modifications in the initial layers would lead to very large modifications in the final layers (essentially an unstable computation), so gradient descend simply cannot move that fast.
Instead, it's a vital signal for debugging your network. Inspecting things like gradient magnitudes per layer shows you might have vanishing or exploding gradients. And that has lead to great inventions how to deal with that, such as residual networks and a whole class of normalization methods (such as batch normalization).