I'm not so sure about that. The incorrect answers to just about any given problem are in the problem set as well, but you can pretty reliably predict that the correct answer will be given, granted you have a statistical correlation in the training data. If your training data is sufficiently moral, the outputs will be as well.
> If your training data is sufficiently moral, the outputs will be as well.
Correction: if your training data and the input prompts are sufficiently moral. Under malicious queries, or given the randomness introduced by sufficiently long chains of input/output, it's relatively easy to extract content from the model that the designers didn't want their users to get.
In any case, the elephant in the room is that the models have not been trained with "sufficiently moral" content, whatever that means. Large Language Models need to be trained on humongous amounts of text, which means that the builders need to use a lot of different, very large corpuses of content. It's impossible to filter all that diverse content to ensure that only 'moral content' is used; yet if it was possible, the model would be extremely less useful for the general case, as it would have large gaps of knowledge.