I dunno... gradient descent is only really reliable with a big bag of tricks. Knowing good initializations is a starting point, but recurrent connections and batch/layer normalization go a very long way towards making it reliable.
I agree, this is the correct way to see it IMO. Instead of designing better optimizers, we designed easier parameterizations to optimize. The surprising part is that these parameterizations exist in the first place.
I agree, this is the correct way to see it IMO. Instead of designing better optimizers, we designed easier parameterizations to optimize. The surprising part is that these parameterizations exist in the first place.