Universal approximation theorem, embeddings, self-attention, gradient descent. And empirically, scaling laws.