The attention mechanism is capable of computing, in my thought experiment where you can magically pluck a weights-set from a trillion-dimensional space the tokens the machine will predict will only have a tiny subset dedicated to language. We have no capability of training such a system at this time, much like we have no way of training a non-differentiable architecture.