logoalt Hacker News

antonvstoday at 8:11 AM0 repliesview on HN

> But the probability vector is the output of the LLM, no?

In some contexts yes, but that's not actually the policy. As I wrote in my other comment (quoting because I think it's worth highlighting):

> "the policy is a function that, given some context, assigns probabilities to possible next tokens."

In the same sentence, I also incorrectly referred to this as a "probability distribution", but that's not accurate: it's a function that produces a probability distribution. The policy instantiated at a specific context produces a probability distribution.

In fact, you'd be closer to the mark if you called the policy "the model", but the two terms emphasize different aspects - as I said, "policy" views it from an RL perspective. From that perspective, the policy is a function, the model is an implementation of that function.

Besides, "output of the LLM" is ambiguous. It commonly means the actual generated token(s) (or text), not the probability distribution. Depending on context, "output of the LLM" could refer to (1) logits, (2) the probability distribution, (3) a single selected token, (4) the full generated text.

"Policy" has no such ambiguity - it has a precise definition. That's why technical subjects rely on jargon in the first place, but it results in the exact issue we're running into here: "Jargon enables quick and precise communication among insiders, but it is usually confusing or unintelligible to outsiders."