I'm guessing there's a very strong prior to "just keep generating more tokens" as opposed to deleting code that needs to be overcome. Maybe this is done already but since every git project comes with its own history, you could take a notable open-source project (like LLVM) and then do RL training against against each individual patch committed.
Perhaps the problem is that you RL on one patch a time, failing to capture the overarching long term theme, an architecture change being introduced gradually over many months, that exists in the maintainer’s mental model but not really explicitly in diffs.
right, it would have to a specialized tool that you used to do analysis of codebase every now and then, or parts that you thought should be cleaned up.
Obviously there is a just keep generating more tokens bias in software management, since so many developer metrics over the years do various lines of code style analysis on things.
But just as experience and managerial programs have over time developed to say this is a bad bias for ranking devs, it should be clear it is a bad bias for LLMs to have.