I think I've heard multiple time that a large % of training compute for SoTA models is inference to generate training tokens, this is bound to happen with RL training