> so in principle, setting temperature to 0 _should_ result in deterministic outputs
It is a common misconception, but it is not true even in principle. If I have 2 or more logits which are equal to the maximum of my logits, I will sample uniformly random from them with any temperature, even zero. Sampling from softmax([1, 0, 1]) is still stochastic at temperature 0, because the limit is to sample uniformly from the first or the last element.
Anyway: "GPUs don't do deterministic matrix multiplications" is the biggest source of randomness in LLMs. GPUs put the associativity of the sums in matrix multiplications in arbitrary order, and this has a huge impact on the logits coming out of the neural network.
> "GPUs don't do deterministic matrix multiplications" is the biggest source of randomness in LLMs.
But this isn't a fundamental property of LLMs, it's just an implementation detail. It's pretty obvious that if you evaluate the matrix multiplications correctly and deterministically sample from the highest-probability outputs, you will have a deterministic LLM.
> GPUs put the associativity of the sums in matrix multiplications in arbitrary order
That’s user-controlled too, not an inherent property of GPUs:
https://docs.pytorch.org/docs/2.12/generated/torch.use_deter...
You don't have to sample uniformly. You could take the lowest index of all maxima. But yeah, the main source of randomness is non-deterministic matmul, and temperature does nothing with it