> But like humans — and unlike computer programs — they do not produce the exact same results eve...

nemo1618 • today at 1:19 AM • 4 replies • view on HN

> But like humans — and unlike computer programs — they do not produce the exact same results every time they are used. This is fundamental to the way that LLMs operate: based on the "weights" derived from their training data, they calculate the likelihood of possible next words to output, then randomly select one (in proportion to its likelihood).

This is emphatically not fundamental to LLMs! Yes, the next token is selected randomly; but "randomly" could mean "chosen using an RNG with a fixed seed." Indeed, many APIs used to support a "temperature" parameter that, when set to 0, would result in fully deterministic output. These parameters were slowly removed or made non-functional, though, and the reason has never been entirely clear to me. My current guess is that it is some combination of A) 99% of users don't care, B) perfect determinism would require not just a seeded RNG, but also fixing a bunch of data races that are currently benign, and C) deterministic output might be exploitable in undesirable ways, or lead to bad PR somehow.

Replies

pavpanchekha • today at 1:32 AM

Deterministic output is incompatible with batching, which in turn is critical to high utilization on GPUs, which in turn is necessary to keep costs low.

➕ show 2 replies

valenterry • today at 5:18 AM

> This is emphatically not fundamental to LLMs! Yes, the next token is selected randomly; but "randomly" could mean "chosen using an RNG with a fixed seed."

This. Thanks for saying that, because now I don't need to read the article, since if the author doesn't even get that, I'm not interested in the rest.

jrmg • today at 3:37 AM

LLMs are, fundamentally, compressed lookup tables that map input -> input + next token. Or, If you like, input -> input + list of possible next tokens with probabilities.

willj • today at 4:05 AM

The temperature parameters largely went away when we moved towards reasoning models, which output lots of reasoning tokens before you get to the actual output tokens. I don’t know if it was found that reasoning works better with a higher temperature, or that having separate temperatures for reasoning vs. output wasn’t practical, but that’s my observation of the timing, anyway. And to the other commenter’s point, even a temperature of 0 is not deterministic if the batches are not invariant, which they’re not in production workloads.

alt Hacker News

Replies