I can't see anything about "training a transformer". I'm trying to understand if e.g. the Sudoku solver was learned from examples (in which case, what examples?) or whether it was manually coded and then "compiled" into weights.
I would assume it was manually coded.
I assumed that they had to train, otherwise how else would they get "inside" a transformer.
I also feel a bit of bad smell from the article. Sounding revolutionary with no details or clear explanation.