Let's be real, if you and I both ask claude to generate a feature on the same project, what are the chances that it spits out 100% replicated code? But if we are to build the project using a Dockerfile, we will get the same binary and the same image. Products around LLMs are non deterministic unlike compilers.
it's nondeterministic because we chosen it by having higher 'temperature' in settings. I bet if you run open weights model with temperature 0 and on the same device the same prompt and turn off parallelism you will have more deterministic result (excluding some floating point operations).
I can assure you that a fully deterministic and equally effective claude is possible to build. And yes, that would mean identical prompts would yield 100% identical output 100% of the time. It would still make the occasional logical or factual error, but it would do so deterministically. Would this solve any of the problems with building reliable programs using LLMs?