I dunno... gradient descent is only really reliable with a big bag of tricks. Knowing good initializ...

sdenton4 • yesterday at 11:19 PM • 1 reply • view on HN

I dunno... gradient descent is only really reliable with a big bag of tricks. Knowing good initializations is a starting point, but recurrent connections and batch/layer normalization go a very long way towards making it reliable.

Replies

hellohello2 • today at 12:08 AM

I agree, this is the correct way to see it IMO. Instead of designing better optimizers, we designed easier parameterizations to optimize. The surprising part is that these parameterizations exist in the first place.

alt Hacker News

Replies