Wtf is a policy? Is this some sort of RL thing that I'm too ML to understand? Gemini tells me...

greesil • today at 3:34 AM • 2 replies • view on HN

Wtf is a policy? Is this some sort of RL thing that I'm too ML to understand?

Gemini tells me it's the probability of the next token for an LLM. Okay then.

Replies

It’s quite common these days to treat an LLM as a policy in the sense that it takes as a “state” the previous context, and its task is to choose a continuation, as an “action”. It gets a “reward” from a reward model that was trained on human preferences, or from a verifiable source, such as passing test cases.

This framing has been active for several years, as it’s the framing that enables RLHF and RLVR. RLHF itself is quite old, I think since the original chatGPT.

mountainriver • today at 3:42 AM

What is this comment? It’s an RL paper, these are standard RL terms

➕ show 1 reply

alt Hacker News

Replies