I'm guessing there's a very strong prior to "just keep generating more tokens" as opposed to deleting code that needs to be overcome. Maybe this is done already but since every git project comes with its own history, you could take a notable open-source project (like LLVM) and then do RL training against against each individual patch committed.
I think this is in the training data since they use commit data from repos, but I imagine code deletions are rarer than they should be in the real data as well.
I'm guessing there's a very strong prior to "just keep generating more tokens" as opposed to deleting code that needs to be overcome. Maybe this is done already but since every git project comes with its own history, you could take a notable open-source project (like LLVM) and then do RL training against against each individual patch committed.