Their R1 paper was really well-done. But I think it leaves out a few details necessary for stable training.
https://cameronrwolfe.substack.com/p/grpo-tricks