logoalt Hacker News

greesiltoday at 3:34 AM2 repliesview on HN

Wtf is a policy? Is this some sort of RL thing that I'm too ML to understand?

Gemini tells me it's the probability of the next token for an LLM. Okay then.


Replies

Ifkaluvatoday at 4:46 AM

It’s quite common these days to treat an LLM as a policy in the sense that it takes as a “state” the previous context, and its task is to choose a continuation, as an “action”. It gets a “reward” from a reward model that was trained on human preferences, or from a verifiable source, such as passing test cases.

This framing has been active for several years, as it’s the framing that enables RLHF and RLVR. RLHF itself is quite old, I think since the original chatGPT.

mountainrivertoday at 3:42 AM

What is this comment? It’s an RL paper, these are standard RL terms

show 1 reply