> That's what the cache hierarchies are for That’s the core point though. If you do batche...

pests • 06/25/2025 • 1 reply • view on HN

> That's what the cache hierarchies are for

That’s the core point though. If you do batches the cache and registers are already primed and ready. The model runs in steps/layers accessing different weights in VRAM along the way. When batching you take advantage of this.

I’m in agreement that RAM to VRAM is important too but I feel the key speed up for inference batching is my above point.

Replies

menaerus • 06/25/2025

Not really. Registers are irrelevant. They are not the bottleneck.

➕ show 1 reply

alt Hacker News

Replies