logoalt Hacker News

pests06/25/20251 replyview on HN

> That's what the cache hierarchies are for

That’s the core point though. If you do batches the cache and registers are already primed and ready. The model runs in steps/layers accessing different weights in VRAM along the way. When batching you take advantage of this.

I’m in agreement that RAM to VRAM is important too but I feel the key speed up for inference batching is my above point.


Replies

menaerus06/25/2025

Not really. Registers are irrelevant. They are not the bottleneck.

show 1 reply