> I think there is a strong case to make that many of their limitations come from them doing what we have told them to do. Hallucinations are the stand out example of this. If you train it to give answers to questions, it will answer questions, but it might have to make up the answer to do so. This isn't not knowing that it does not know. This is doing the task given to it regardless of whether it knows or not.
Are you asserting that an LLM could be NOT trained to answer when it knows it doesn’t know the answer, or if that’s not possible be trained to NOT answer when it knows it doesn’t know the answer?
If so, I would believe your thinking, but for some reason I have not yet seen a single LLM that behaves with that kind of self-knowledge.
Of course it's possible.
I don't say this, because I know how, but because I see no reason why we will be unable to crack that problem. If our brains can do it, so will AI one day.
It should be trained to answer when it knows the answer, and to state that it does not know the answer when it does not. They might already have a very good understanding of not knowing internally, but are just not trained to express that.
This is not a problem in the ability of the system, it is a problem of how to construct training for such a task.
To provide training examples where it answers it does not know the answer only when it does not know the answer. You need training examples where it says it doesn't know when it does not contain that knowledge, but it provides an answer when it does know the answer.
To create such an example, you need to know in advance what the model knows and what the model does not know. You can't just have a database of facts that it knows, because you also need to count things that it can readily infer.
Any model that can reliably give the sum of any two 10 digit integers should be able to answer so. You can't list every possible number that a model knows how to add. That is just the tiniest subset of the task you would have to do because you have to determine every inferrable fact, not just integers. Adding to the problem is that training on questions like this can add to the knowledge base to the model either from the question itself or by inductively figuring out the answer based upon the combination of the question and the fact that it was not expected to know the answer.
A completely different training system would have to be implemented. There is research on categorising patterns of activations that can determine a form of 'mental state' of a model. A dynamic training approach where the answer that the model is expected-to-give/rewarded-for-giving is partially dependent on the models own state could be achieved through this mechanism.