That proof only applies to fixed architecture feed forward multilayer perceptrons with no recurrence, iirc. Transformers are not that.