> It seems to me that in 2016 people did (have to) play a lot more tricks with the backpropagation than today
Perhaps, but maybe because there was more experimentation with different neural net architectures and nodes/layers back then?
Nowadays the training problems are better understood, clipping is supported by the frameworks, and it's easy to find training examples online with clipping enabled.
The problem itself didn't actually go away. ReLU (or GELU) is still the default activation for most networks, and training an LLM is apparently something of a black art. Hugging Face just released their "Smol Training Playbook: a distillation of hard earned knowledge to share exactly what it takes to train SOTA LLMs", so evidentially even in 2025 training isn't exactly a turn-key affair.
I wonder if its just about the important neural nets now being trained by large, secretive corporations that aren't interested in sharing their knowledge.