The whole point of RLHF is to make up for the fact that there is no loss function for a good answer in terms of token ids or their order. A good answer can come in many different forms and shapes.
That’s why all those models fine tuned on (instruction, input, answer) tuples are essentially lobotomized. They’ve been told that, for the given input, only the output given in the training data is correct, and any deviation should be “punished”.
In truth, for each given input, there are many examples of output that should be reinforced, many examples of output that should be punished, and a lot in between.
When BF Skinner used to train his pigeons, he’d initially reinforce any tiny movement that at least went in the right direction. For example, instead of waiting for the pigeon to peck the lever directly (which it might not do for many hours), he’d give reinforcement if the pigeon so much as turned its head towards the lever. Over time, he’d raise the bar. Until, eventually, only clear lever pecks would receive reinforcement.
We should be doing the same when taming LLMs from their pretraining as document completers into assistants.