logoalt Hacker News

Retr0idyesterday at 10:45 PM1 replyview on HN

> RLVR is weirder, and I suspect it's why we see "It's not X, it's Y" so often.

This feels like an easy enough hypothesis to verify, for anyone in the business of training LLMs - does the not-X-but-Y rate increase after RLVR?


Replies

andy99yesterday at 11:01 PM

It’s unlikely this is true. LLMs are way more mad-libs / templates than we like to admit, that’s (ironically) not a judgement about their capability, it’s primarily just an observation. But it’s also what plain old SFT, which I believe is the primary culprit, ends up imparting.