> The training doesn't evaluate "is the answer true" or "is the answer useful." It's either "is the answer likely to appear in the training corpus" or "is the RLHF judge happy with the answer." We are optimising LLMs to produce output which looks like high quality output.
It's not quite as dire as this. One of the main reasons why LLM's are getting better over time is that they are used themselves to bootstrap the next generation by sifting through the training set to do 'various things' to it.
People often forget that the training corpus contains everything humanity ever produced and anything new humanity will produce will likely come from it as well. Torturing it with current generation models is among the most productive things you can do to improve the next generation systems.