It seems to me that in 2016 people did (have to) play a lot more tricks with the backpropagation than today. Back then it was common to meddle with gradients in between the gradient propagation.
For example, Alex Graves's (great! with attention) 2013 paper "Sequence Generation with Recurrent Neural Networks" has this line:
One difficulty when training LSTM with the full gradient is that the derivatives sometimes become excessively large, leading to numerical problems. To prevent this, all the experiments in this paper clipped the derivative of the loss with respect to the network inputs to the LSTM layers (before the sigmoid and tanh functions are applied) to lie within a predefined range.
with this footnote:
In fact this technique was used in all my previous papers on LSTM, and in my publicly available LSTM code, but I forgot to mention it anywhere—mea culpa.
That said, backpropagation seems important enough to me that I once did a specialized videocourse just about PyTorch (1.x) autograd.
> It seems to me that in 2016 people did (have to) play a lot more tricks with the backpropagation than today
Perhaps, but maybe because there was more experimentation with different neural net architectures and nodes/layers back then?
Nowadays the training problems are better understood, clipping is supported by the frameworks, and it's easy to find training examples online with clipping enabled.
The problem itself didn't actually go away. ReLU (or GELU) is still the default activation for most networks, and training an LLM is apparently something of a black art. Hugging Face just released their "Smol Training Playbook: a distillation of hard earned knowledge to share exactly what it takes to train SOTA LLMs", so evidentially even in 2025 training isn't exactly a turn-key affair.