I am not a mathematician, but I do enough weird stuff that I encounter things referring to Hessians, yet I don't really know what they are, because everyone who writes about them does so in terms that assumes the reader knows what they are.
Any hints? The Battenburg graphics of matrices?
GRADIENT
In the context of optimizing parameters of a model, the Gradient consists of all the derivatives of the output being optimized (i.e. the total error measure) with respect to each of the models parameters.
This creates a simplified version of the model, linearized around its current parameter values, making it easy to see which direction to take a small step to move the ultimate output in the direction that is desired.
And easy to see which parameters adjust the desired output more vs. less.
[EDIT] Nx1 1st derivative vector, N = #parameters, 1 = scalar output.
HESSIAN
The Hessian consists of all 2nd order derivatives, i.e. not just slope, but the curvature of the model, around the current parameter values.
Calculating all the first and 2nd degree derivatives takes more calculations and memory, but allows for more information as to which direction to take a learning step. As not only do we know how the output will respond linearly to a small parameter change, but whether larger changes will produce higher or lower than linear responses.
This can allow for the calculation of much larger changes to parameters, with high output improvements, speeding up training considerably, per training step.
But the trade off is each learning step requires more derivative calculations and memory. So a conducive model architecture, and clever tricks, are often needed to make the Hessian worth using, on larger models.
[EDIT] NxNx1 = NxN 2nd derivative matrix, N = #parameters, 1 = scalar output.
JACOBIAN
Another derivative type is the Jacobian, which is the derivate of every individual output (i.e. all those numbers we normally think of as the outputs, not just the final error measure), with respect to every parameter.
Jacobians can become enormous matrices. For billions of parameters, on billions of examples, with 100's of output elements, we would get a billions x 100's of billions derivative matrix. So the Jacobians calculation can take enormous amounts of extra computation and memory. But there are still occasions (much fewer) when using it can radically speed up training.
[EDIT] NxQxM 1st derivative matrix, N = #parameters, Q = #samples, M = #output elements
At this point, we have enough computer power and memory available, that all small enough problems should be trained with Jacobians in my view. Levenberg-Marquardt is an optimization algorithm that uses Jacobians. It can be orders of magnitude faster than gradient descent.
This helped me, coming from an ml background: https://randomrealizations.com/posts/xgboost-explained/