Backpropagation is a leaky abstraction (2016)

280 points • by swatson741 • yesterday at 5:20 AM • 119 comments • view on HN

Comments

Its a nit pick, but backpropagation is getting a bad rep here. These examples are about gradients+gradient descent variants being a leaky abstraction for optimization [1].

Backpropagation is a specific algorithm for computing gradients of composite functions, but even the failures that do come from composition (multiple sequential sigmoids cause exponential gradient decay) are not backpropagation specific: that's just how the gradients behave for that function, whatever algorithm you use. The remedy, of having people calculate their own backwards pass, is useful because people are _calculating their own derivatives_ for the functions, and get a chance to notice the exponents creeping in. Ask me how I know ;)

[1] Gradients being zero would not be a problem with a global optimization algorithm (which we don't use because they are impractical in high dimensions). Gradients getting very small might be dealt with by with tools like line search (if they are small in all directions) or approximate newton methods (if small in some directions but not others). Not saying those are better solutions in this context, just that optimization(+modeling) are the actually hard parts, not the way gradients are calculated.

➕ show 2 replies

gchadwick • yesterday at 7:20 AM

Karpathy's contribution to teaching around deep learning is just immense. He's got a mountain of fantastic material from short articles like this, longer writing like https://karpathy.github.io/2015/05/21/rnn-effectiveness/ (on recurrent neural networks) and all of the stuff on YouTube.

Plus his GitHub. The recently released nanochat https://github.com/karpathy/nanochat is fantastic. Having minimal, understandable and complete examples like that is invaluable for anyone who really wants to understand this stuff.

➕ show 2 replies

joaquincabezas • yesterday at 8:03 AM

I took a course in my Master's (URV.cat) where we had to do exactly this, implementing backpropagation (fwd and backward passes) from a paper explaining it, using just basic math operations in a language of our choice.

I told everyone this was the best single exercise of the whole year for me. It aligns with the kind of activity that I benefit immensely but won't do by myself, so this push was just perfect.

If you are teaching, please consider this kind of assignments.

P.S. Just checked now and it's still in the syllabus :)

➕ show 2 replies

drivebyhooting • yesterday at 7:21 AM

I have a naive question about backprop and optimizers.

I understand how SGD is just taking a step proportional to the gradient and how backprop computes the partial derivative of the loss function with respect to each model weight.

But with more advanced optimizers the gradient is not really used directly. It gets per weight normalization, fudged with momentum, clipped, etc.

So really, how important is computing the exact gradient using calculus, vs just knowing the general direction to step? Would that be cheaper to calculate than full derivatives?

➕ show 4 replies

t-vi • yesterday at 5:23 PM

It seems to me that in 2016 people did (have to) play a lot more tricks with the backpropagation than today. Back then it was common to meddle with gradients in between the gradient propagation.

For example, Alex Graves's (great! with attention) 2013 paper "Sequence Generation with Recurrent Neural Networks" has this line:

One difficulty when training LSTM with the full gradient is that the derivatives sometimes become excessively large, leading to numerical problems. To prevent this, all the experiments in this paper clipped the derivative of the loss with respect to the network inputs to the LSTM layers (before the sigmoid and tanh functions are applied) to lie within a predefined range.

with this footnote:

In fact this technique was used in all my previous papers on LSTM, and in my publicly available LSTM code, but I forgot to mention it anywhere—mea culpa.

That said, backpropagation seems important enough to me that I once did a specialized videocourse just about PyTorch (1.x) autograd.

➕ show 1 reply

mirawelner • yesterday at 11:04 PM

I feel like my learning curve for AI is:

1) Learn backprop, etc, basic math

2) Learn more advanced things, CNNs, LMM, NMF, PCA, etc

3) Publish a paper or poster

4) Forget basics

5) Relearn that backprop is a thing

repeat.

Some day I need to get my education together.

stared • yesterday at 9:38 AM

The original title is "Yes you should understand backprop" - which is good and descriptive.

sebastianconcpt • yesterday at 1:57 PM

This comment:

> “Why do we have to write the backward pass when frameworks in the real world, such as TensorFlow, compute them for you automatically?”

worries me because is structured with the same reasoning of "why we have to demonstrate we understand addition if in the real world we have calculators"

➕ show 1 reply

jamesblonde • yesterday at 8:20 AM

I have to be contrarian here. The students were right. You didn't need to learn to implement backprop in NumPy. Any leakiness in BackProp is addressed by researchers who introduce new optimizers. As a developer, you just pick the best one and find good hparams for it.

➕ show 4 replies

WithinReason • yesterday at 10:20 AM

Karpathy suggests the following error:

  def clipped_error(x): 
    return tf.select(tf.abs(x) < 1.0, 
                   0.5 * tf.square(x), 
                   tf.abs(x) - 0.5) # condition, true, false

Following the same principles that he outlines in this post, the "- 0.5" part is unnecessary since the gradient of 0.5 is 0, therefore -0.5 doesn't change the backpropagated gradient. In addition, a nicer formula that achieves the same goal as the above is √(x²+1)

➕ show 3 replies

alyxya • yesterday at 7:39 AM

More generally, it's often worth learning and understanding things one step deeper. Having a more fundamental understanding of things explains more of the "why" behind why some things are the way they are, or why we do some things a certain way. There's probably a cutoff point for balancing how much you actually need to know though. You could potentially take things a step further by writing the backwards pass without using matrix multiplication, or spend some time understanding what the numerical value of a gradient means.

raindear • yesterday at 8:42 PM

Are dead ReLUs still a pronlem today? Why not?

emil-lp • yesterday at 7:35 AM

... (2016)

9 years ago, 365 points, 101 comments

https://news.ycombinator.com/item?id=13215590

away74etcie • yesterday at 12:28 PM

Karpathy's work on large datasets for deep neural flow is conceiving of the "backward pass" as the preparation for initializing the mechanics for weight ranges, either as derivatives in -10/+10 statistic deviations.

xpe • yesterday at 3:42 PM

Karpathy is butchering the metaphor. There is no abstraction here. Backprop is an algorithm. Automatic differentiation is a technique. Neither promises to hide anything.

I agree that understanding them is useful, but they are not abstractions much less leaky abstractions.

joaquincabezas • yesterday at 11:49 AM

off-topic, anybody knows what's going on with EurekaLabs? It's been a while since the announcement

➕ show 2 replies

joshdavham • yesterday at 7:14 AM

Given that we're now in the year 2025 and AI has become ubiquitous, I'd be curious to estimate what percentage of developers now actually understand backprop.

It's a bit snarky of me, but whenever I see some web developer or product person with a strong opinion about AI and its future, I like to ask "but can you at least tell me how gradient descent works?"

I'd like to see a future where more developers have a basic understanding of ML even if they never go on to do much of it. I think we would all benefit from being a bit more ML-literate.

➕ show 4 replies

brcmthrowaway • yesterday at 8:45 AM

Do LLMs still use backprop?

➕ show 1 reply

phplovesong • yesterday at 7:41 AM

Sidenote why are people still using medium?

➕ show 1 reply

littlestymaar • yesterday at 8:17 AM

I was happy to see Karpathy writing a new blog post instead of simply Twitter threads, but when I opened the link I just got dispointed to realize it's from 9 years ago…

I really hate what Twitter did to blogging…

➕ show 1 reply

alt Hacker News

Backpropagation is a leaky abstraction (2016)

Comments