logoalt Hacker News

sdenton4yesterday at 11:19 PM1 replyview on HN

I dunno... gradient descent is only really reliable with a big bag of tricks. Knowing good initializations is a starting point, but recurrent connections and batch/layer normalization go a very long way towards making it reliable.


Replies

hellohello2today at 12:08 AM

I agree, this is the correct way to see it IMO. Instead of designing better optimizers, we designed easier parameterizations to optimize. The surprising part is that these parameterizations exist in the first place.