You still need to do a forward pass per token. With massive batching and full pipelining you might be able to break the dependencies and output one token per cycle but clearly they aren't doing that.
More aggressive pipelining will probably be the next step.
More aggressive pipelining will probably be the next step.