Can someone explain why this is? Do LLMs somehow contain a true random number generator? Why wouldn't they produce the same outputs given the same inputs (even temperature)?
edit: I'm not talking about an LLM as accessed through a provider. I'm just talking about using a model directly. Why wouldn't that be deterministic?
The model outputs a probability distribution for the next token, given the sequence of all previous tokens in the context window. It’s just a list of floats in the same order as the list of tokens that the tokenizer uses.
After that, a piece of software that is NOT the LLM chooses the next token. This is called the sampler. There are different sampling parameters and strategies available, but if you want repeatable* outputs, just take the token with the highest probability number.
* Perfect determinism in this sense is difficult to achieve because GPU calculations naturally have a minor bit of nondeterminism. But you can get very close.
An LLM model itself -- that is, the weights and the mathematical functions linking them -- does not tell you exactly how to train from data, nor how to generate an output. Instead, it describes a function providing relative likelihood(output | input).
Deciding how to pick a particular output given that likelihood function is left as an exercise for the user, which we call inference.
One obvious choice is to keep picking the highest likelihood token, feed it into the model, and get another -- on repeat. This is what most algorithms call "temperature=0". But doing this for token after token can lead boring output, or steer you into pathological low-probability sequences like a set of endless repeats.
So, the current SOTA is to intentionally introduce a random factor (temperature>0) to the sampling process -- along with other hacks, like explicit suppression of repeats.
Yea sure. So temperature is baked into these LLM models and when it isn't zero it increases the probability of taking a different path to decode the tokens. Whether it's at a provider or downloaded on your own machine.
Technically even when the temperature is 0 it's not deterministic but it's more likely to be... You can have ties in probabilities for generating the next words. And floating point noise is real.
All these models are doing is guesstimating the next token to say.
There are A LOT of misconceptions about llms, biggest one is they are not deterministic. And they are 100% deterministic and temperature has nothing to do with it. You WILL get exactly same result every single time (at ANY temperature) as long as you use same sampling parameters and server config parameters. What causes variance in LLM's is server parameters like batch processing and caching among a few other things possibly. the batching being responsible for most of the issues. The reason that flag is used is because large providers serve multiple customers per one gpu, and breaking up the vram is tricky and causes drift. If you start llama.cpp for example with only one person per slot batching off, you will always get same results every time even at temperature 1.2 or whatever other parameters because you are using one gpu per inferance call so no fucky buseness there. Reason most people are unaware of this is because most people have experience only with api instead of working with the actual inferance enjine itself so this godd damned myth keeps spreading. my vide for referance here where you can download and try for yourself. https://www.youtube.com/watch?v=EyE5BrUut2o