The Transformer is the more powerful model than Markov chain, but on such a weak machine as the C64, a MC could output text faster - but it surely would sound "psychedelic", as the memory limits a MC to a first-order or second-order model, so to predict one word, only the two words before would be taken into account as context (and no attention).
On a plain vanilla C64, the Transformer cannot really show what it's capable of doing. An implementation using 2 bit per weight (vectorized) could be slightly better, perhaps.
The Transformer is the more powerful model than Markov chain, but on such a weak machine as the C64, a MC could output text faster - but it surely would sound "psychedelic", as the memory limits a MC to a first-order or second-order model, so to predict one word, only the two words before would be taken into account as context (and no attention).
On a plain vanilla C64, the Transformer cannot really show what it's capable of doing. An implementation using 2 bit per weight (vectorized) could be slightly better, perhaps.