LLMs are not deterministic:
1.) There is typically a temperature setting (even when not exposed, most major providers have stopped exposing it [esp in the TUIs]).
2.) Then, even with the temperature set to 0, it will be almost deterministic but you'll still observe small variations due to the limited precision of float numbers.
Edit: thanks for the corrections
Limited precision of float numbers is deterministic. But there's whole parallelism and how things are wired together, your generation may end up on a different hardware etc.
And models I work with (claude,gemini etc) have the temperature parameter when you are using API.
You shouldn't be downvoted - LLMs could in theory be deterministic, but they currently are not, due to how models are implemented.
> but you'll still observe small variations due to the limited precision of float numbers
No. Floating number arithmetic is deterministic. You don't get different answers for the same operations on the same machine just because of limited precision. There are reasons why it can be difficult to make sure that floating point operations agree across machines, but that is more of a (very annoying and difficult to make consistent) configuration thing than determinism.
(In general it is mildly frustrating to me to see software developers treat floating point as some sort of magic and ascribe all sorts of non-deterministic qualities to it. Yes floating point configuration for consistent results across machines can be absurdly annoying and nigh-impossible if you use transcendental functions and different binaries. No this does not mean if your program is giving different results for the same input on the same machine that this is a floating point issue).
In theory parallel execution combined with non-associativity can cause LLM inference to be non-deterministic. In practice that is not the case. LLM forward passes rarely use non-deterministic kernels (and these are usually explicitly marked as such e.g. in PyTorch).
You may be thinking of non-determinism caused by batching where different batch sizes can cause variations in output. This is not strictly speaking non-determinism from the perspective of the LLM, but is effectively non-determinism from the perspective of the end user, because generally the end user has no control over how a request is slotted into a batch.