There used to be a word for this in generative AI: mode collapse. It's not that the model doesn't generate human-like responses, it's that it generates the same 0.0001% of possible human like responses every time. It's almost certainly the instruction tuning which is responsible, maybe some small part of blame could go to the rollout policy (I have no idea how rollout policy works these days).