> No, all these models are just bad for anything that they weren't RLed for, and decent for things they were
Are you claiming that the models are RLed to intentionally adding errors to our programs when you use them, or what's the argument you're trying to make here? Otherwise I don't see how it's relevant to how I said.
No, I am making the argument that models have poor capabilities outside of tasks they are RLed for, and their capabilities inside those tasks are only as good as capabilities of people evaluating their responses, i.e. not great. Even if you instruct the model "don't do X" or "do X this way"—you cannot rely on the model following that instruction. This means that there is nothing you can do if model makes "errors."
Not necessarily relevant, but fun, I had the ChatGPT model correct itself mid-response when checking my math work. It started by saying that I was wrong, then it proceeded to solve the problem and at the end it realized that I was correct.