Attention is all you need…?
The short answer, as far as I’m aware, is that no one really knows. The longer answer is that we have a lot of partial answers that, in my mind, basically boil down to: model architectures draw a walk through the high dimensional vector space of concepts, and we’ve tuned them to land on the right answer. The fact that they do so consistently says something about how we encode logic in language and the effectiveness of these embedding/latent spaces.