> That's what the cache hierarchies are for
That’s the core point though. If you do batches the cache and registers are already primed and ready. The model runs in steps/layers accessing different weights in VRAM along the way. When batching you take advantage of this.
I’m in agreement that RAM to VRAM is important too but I feel the key speed up for inference batching is my above point.
Not really. Registers are irrelevant. They are not the bottleneck.