It turns out the inverse of the Hessian of a deep net is easy to apply to a vector. Doing this naively takes cubically many operations in the number of layers (so impractical), but it's possible to do this in time linear in the number of layers (so very practical)!
This is possible because the Hessian of a deep net has a matrix polynomial structure that factorizes nicely. The Hessian-inverse-product algorithm that takes advantage of this is similar to running backprop on a dual version of the deep net. It echoes an old idea of Pearlmutter's for computing Hessian-vector products.
Maybe this idea is useful as a preconditioner for stochastic gradient descent?
I am not a mathematician, but I do enough weird stuff that I encounter things referring to Hessians, yet I don't really know what they are, because everyone who writes about them does so in terms that assumes the reader knows what they are.
Any hints? The Battenburg graphics of matrices?
Great work. Making the Hessian calculation linear in depth is a solid intermediate step. Thanks for sharing this; I look forward to seeing the final results as this research matures.
I haven't looked into it in years, but would the inverse of a block bi-diagonal matrix have some semiseperable structure? Maybe that would be good to look into?
[dead]
>If the Hessian-vector product is Hv for some fixed vector v, we're interested in solving Hx=v for x. The hope is to soon use this as a preconditioner to speed up stochastic gradient descent.
Silly question, but if you have some clever way to compute the inverse Hessian, why not go all the way and use it for Newton's method, rather than as a preconditioner for SGD?