Isn’t this in a sense an RNN built out of a slice of an LLM? Which if true means it might have the same drawbacks, namely slowness to train but also benefits such as an endless context window (in theory)
It's sort of an RNN, but it's also basically a transformer with shared layer weights. Each step is equivalent to one transformer layer, the computation for n steps is the same as the computation for a transformer with n layers.
The notion of context window applies to the sequence, it doesn't really affect that, each iteration sees and attends over the whole sequence.
It's sort of an RNN, but it's also basically a transformer with shared layer weights. Each step is equivalent to one transformer layer, the computation for n steps is the same as the computation for a transformer with n layers.
The notion of context window applies to the sequence, it doesn't really affect that, each iteration sees and attends over the whole sequence.