It would apply equally to GPU or RAM inference as those are also bandwidth constrained on decode, so people already try to optimize for it.