Bring up passage that supports your claim. I'll wait.

measurablefunc • yesterday at 8:05 PM • 1 reply • view on HN

Replies

Not exactly sure what you are looking for here.

That GRPO works?

> Group Relative Policy Optimization (GRPO), a variant reinforcement learning (RL) algorithm of Proximal Policy Optimization (PPO) (Schulman et al., 2017). GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources. By solely using a subset of English instruction tuning data, GRPO obtains a substantial improvement over the strong DeepSeekMath-Instruct, including both in-domain (GSM8K: 82.9% → 88.2%, MATH: 46.8% → 51.7%) and out-of-domain mathematical tasks (e.g., CMATH: 84.6% → 88.8%) during the reinforcement learning phase

Page 2 of https://arxiv.org/pdf/2402.03300

That GRPO on code works?

> Similarly, for code competition prompts, a compiler can be utilized to evaluate the model’s responses against a suite of predefined test cases, thereby generating objective feedback on correctness

Page 4 of https://arxiv.org/pdf/2501.12948

➕ show 1 reply

alt Hacker News

Replies