logoalt Hacker News

mrtesthahyesterday at 9:11 PM0 repliesview on HN

>"is the RLHF judge happy with the answer."

Reinforcement Learning with Verifiable Rewards (RLVR) to improve math and coding success rates seems like an exception.