I assume it's a lack of care when RLing them.
RL has a tendency to reinforce cheating when the cheats are easier to find than the final solution.
So when making your RL environment, you need to spend a lot of effort on finding ways the model can cheat and penalizing them.