logoalt Hacker News

Softmax forever, or why I like softmax

173 pointsby jxmorris12last Sunday at 7:08 AM97 commentsview on HN

Comments

roger_yesterday at 1:02 PM

An aside: please use proper capitalization. With this article I found myself backtracking thinking I’d missed a word, which was very annoying. Not sure what the authors intention was with that decision but please reconsider.

show 10 replies
mauritsyesterday at 1:07 PM

For people interested in the softmax, log sum exp and energy models, have a look at "Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One" [1]

[1]: https://arxiv.org/abs/1912.03263

staredyesterday at 10:07 AM

There are many useful tricks - like cosine distance.

In contrast, softmax has a very deep grounding in statistical physics - where it is called the Boltzmann distribution. In fact, this connection between statistical physics and machine learning was so fundamental that it was a key part of the 2024 Nobel Prize in Physics awarded to Hopfield and Hinton.

creakingstairsyesterday at 7:35 AM

Because the domain is a Korean name, I half-expected this to be about an old Korean game company[1] with the same name. They made some banger RPGs at the time and had really great art books.

[1] https://en.m.wikipedia.org/wiki/ESA_(company)

incognito124yesterday at 11:19 AM

How to sample from a categorical: https://news.ycombinator.com/item?id=42596716

Note: I am the author

semiinfinitelyyesterday at 7:11 AM

i think that log-sum-exp should actually be the function that gets the name "softmax" because its actually a soft maximum over a set of values. And what we call "softmax" should be called "grad softmax" (since grad of logsumexp is softmax).

show 1 reply
calebmyesterday at 3:47 PM

Funny timing, I just used softmax yesterday to turn a list of numbers (some of which could be negative) into a probability distribution (summing up to 1). So useful. It was the perfect tool for the job.

janalsncmyesterday at 9:52 AM

This is a really intuitive explanation, thanks for posting. I think everyone’s first intuition for “how do we turn these logits into probabilities” is to use a weighted sum of the absolute values of the numbers. The unjustified complexity of softmax annoyed me in college.

The author gives a really clean explanation for why that’s hard for a network to learn, starting from first principles.

AnotherGoodNameyesterday at 5:35 PM

For the answer of is "is softmax the only way to turn unnormalized real values into a categorial distribution" you can just use statistics.

Eg. Using Bayesian stats, if i assume an even prior (pretend i have no assumptions about how biased it is), i see a coin flip heads 4 times in a row, what's the probability of it being heads?

Via a long winded proof using the dirichlet distribution Bayesian stats will say "add one to the top and two to the bottom". Here we saw 4/4 heads. So we guess 5/6 chance of being heads (+1 to the top, +2 to the bottom) the next time or a 1/6 chance of being tails. This represents that the statistical model is assuming some bias in the coin.

That's normalized as a probability against 1 which is what we want. It works for multiple probabilities as well, you add to the bottom as many different outcomes as you have. The Dirichlet distribution allows for real numbers and you can support this too. If you feel this gives too much weight to the possibility of the coin being biased you can actually simply add more to the top and bottom which is the same as accounting for this in your prior, eg. add 100 to the top and 200 to the bottom instead.

Now this has a lot of differences with outcomes compared to softmax. It actually gives everything a non-zero chance rather than using the classic sigmoid activation function that softmax has underneath which moves things to almost absolute 0 or 1. But... other distributions like this are very helpful in many circumstances. Do you actually think the chance of tails becomes 0 if you see heads flipped 100 times in a row? Of course not.

So anyway the softmax function fits things to a particular type of distribution but you can actually fit pretty much anything to any distribution with good old statistics. Choose the right one for your use case.

show 1 reply
yorwbalast Sunday at 8:48 AM

The author admits they "kinda stopped reading this paper" after noticing that they only used one hyperparameter configuration, which I agree is a flaw in the paper, but that's not an excuse for sloppy treatment of the rest of the paper. (It would however, be an excuse to ignore it entirely.)

In particular, the assumption that |a_k| ≈ 0 initially is incorrect, since in the original paper https://arxiv.org/abs/2502.01628 the a_k are distances from one vector to multiple other vectors, and they're unlikely to be initialized in such a way that the distance is anywhere close to zero. So while the gradient divergence near 0 could certainly be a problem, it doesn't have to be as fatal as the author seems to think it is.

show 2 replies
nobodywillobsrvyesterday at 7:12 AM

Softmax’s exponential comes from counting occupation states. Maximize the ways to arrange things with logits as energies, and you get exp(logits) over a partition function, pure Boltzmann style. It’s optimal because it’s how probability naturally piles up.

show 2 replies
littlestymaaryesterday at 6:52 AM

Off topic: Unlike many out there I'm not usually bothered by lack of capitalization in comments or tweets, but for an essay like this, it makes the paragraphs so hard to read!

show 2 replies
bambaxyesterday at 9:38 AM

OT: refusing to capitalize the first word of each sentence is an annoying posture that makes reading what you write more difficult. I tend to do it too when taking notes for myself because I'm the only reader and it saves picoseconds of typing; but I wouldn't dream on inflicting it upon others.

show 1 reply
xchipyesterday at 9:08 AM

The author is trying to show off, you can't tell this because his explanation makes no sense and made it overcomplicated to look smart.