> Perhaps you don't even need access to the state of anything except the penultimate hidden layer, since based on my vague reading of [3] the residual stream doesn't "lose information" as it passes deeper down the attention layers, so each block maybe manipulates a different subspace of the residual stream.
I imagine that conventional transformers kind of force this. If you train a transformer such that it needs to learn the ability to do tasks like “Repeat the following words: apple banana cat” then the model is sort of forced to internally propagate the input far enough along to be able to perform the task. But maybe if you pre-trained from scratch with an architecture where later layers get direct access to earlier layers and/or the raw input, then the model wouldn’t need to propagate information.
Or maybe it would all fall apart and something would go wrong with the gradients.