logoalt Hacker News

notpushkintoday at 2:16 AM4 repliesview on HN

With a temperature of zero, LLM output will always be the same. Then it becomes a matter of getting it to output the exact replica of the input: if we can do that, it will always produce it, and the fact it can also be used as a bullshit machine becomes irrelevant.

With the usual interface it’s probably inefficient: giving just a prompt alone might not produce the output we need, or it might be larger than the thing we’re trying to compress. However, if we also steer the decisions along the way, we can probably give a small prompt that gets the LLM going, and tweak its decision process to get the tokens we want. We can then store those changes alongside the prompt. (This is a very hand-wavy concept, I know.)


Replies

program_whiztoday at 10:56 AM

The models are differentiable, they are trained with backprop. You can easily just run it in reverse to get the input that produces near certainty of producing the output. For a given sequence length, you can create a new optimzation that takes the input sequence, passes to model (frozen) and runs steps over the input sequence to reduce the "loss" which is the desired output. This will give you the optimal sequence of that length to maximize the probability of seeing the output sequence. Of course, if you're doing this to chatGPT or another API-only model, you have no choice but to hunt around.

Of course the optimal sequence to produce the output will be a series of word vectors (of multi-hundreds of dimensions). You could match each to its closest word in any language (or make this a constraint during solving), or just use the vectors themselves as the compressed data value.

Ultimately, NNets of various kinds are used for compression in various contexts. There are some examples where guassian-splatting-like 3d scenes are created by comrpessing all the data into the weights of a nnet via a process similar to what I described to create a fully explorable 3d color scene that can be rendered from any angle.

microtonaltoday at 10:49 AM

A bit of nitpicking, a temperature of zero does not really exist (it would lead to division by zero in softmax). It's sampling (and non-deterministic compute kernels) that makes token prediction non-deterministic. You could simply fix it (assuming deterministic kernels) by using greedy decoding (argmax with a stable sort in the case of ties).

As temperatures approach zero, the probability of the most likely token approaches one (assuming no ties). So my guess is that LLM inference providers started using temperature=0 to disable sampling because people would try to approximate greedy decoding by using teensy temperatures.

duskwufftoday at 3:11 AM

There's an easier and more effective way of doing that - instead of trying to give the model an extrinsic prompt which makes it respond with your text, you use the text as input and, for each token, encode the rank of the actual token within the set of tokens that the model could have produced at that point. (Or an escape code for tokens which were completely unexpected.) If you're feeling really crafty, you can even use arithmetic coding based on the probabilities of each token, so that encoding high-probability tokens uses fewer bits.

From what I understand, this is essentially how ts_zip (linked elsewhere) works.

D-Machinetoday at 6:10 AM

> With a temperature of zero, LLM output will always be the same

Ignoring GPU indeterminism, if you are running a local LLM and control batching, yes.

If you are computing via API / on the cloud, and so being batched with other computations, then no (https://thinkingmachines.ai/blog/defeating-nondeterminism-in...).

But, yes, there is a lot of potential from semantic compression via AI models here, if we just make the efforts.