A distribution with all probability mass on one outcome is deterministic, so in principle, setting t...

aesthesia • today at 5:50 AM • 5 replies • view on HN

A distribution with all probability mass on one outcome is deterministic, so in principle, setting temperature to 0 _should_ result in deterministic outputs. There are a few reasons it might not, but I don't think any of these apply when running a local model like the author did.

Replies

317070 • today at 6:21 AM

> so in principle, setting temperature to 0 _should_ result in deterministic outputs

It is a common misconception, but it is not true even in principle. If I have 2 or more logits which are equal to the maximum of my logits, I will sample uniformly random from them with any temperature, even zero. Sampling from softmax([1, 0, 1]) is still stochastic at temperature 0, because the limit is to sample uniformly from the first or the last element.

Anyway: "GPUs don't do deterministic matrix multiplications" is the biggest source of randomness in LLMs. GPUs put the associativity of the sums in matrix multiplications in arbitrary order, and this has a huge impact on the logits coming out of the neural network.

➕ show 3 replies

easygenes • today at 6:12 AM

There are. If the kernels are nondeterministic (e.g. timing issues) there are minor changes between runs, on a single system, even with eager decode enabled (typically what temperature=0 achieves).

IshKebab • today at 6:13 AM

Setting the temperature to 0 should give deterministic results but that's not any better - it's just hiding the huge variance by only taking one sample.

croes • today at 7:24 AM

So you would get always the same result, but it could be the wrong one

➕ show 1 reply

valzam • today at 5:59 AM

I mean the easiest explanation would be that the model harness doesn't always take the most likely token but does top-k sampling or similar. temperatur just means that probabilities get more and more equalized, boosting the chance that an unlikely token gets picked. but even with temp 0 you could have 0.8 T1, 0.19 T2, ... and sometimes sample T2

➕ show 1 reply

alt Hacker News

Replies