It's sort of an RNN, but it's also basically a transformer with shared layer weights. Each step is equivalent to one transformer layer, the computation for n steps is the same as the computation for a transformer with n layers.
The notion of context window applies to the sequence, it doesn't really affect that, each iteration sees and attends over the whole sequence.
Thanks, this was helpful! Reading the seminal paper[0] on Universal Transformers also gave some insights:
> UTs combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of RNNs.
Very interesting, it seems to be an “old” architecture that is only now being leveraged to a promising extent. Curious what made it an active area (with the works of Samsung and Sapient and now this one), perhaps diminishing returns on regular transformers?
0: https://arxiv.org/abs/1807.03819