The different prompts being batched do not mathematically affect each other. When running inference ...

pests • 06/24/2025 • 2 replies • view on HN

The different prompts being batched do not mathematically affect each other. When running inference you have massive weights that need to get loaded and unloaded just to serve the current prompt and however long its context is (maybe even just a few tokens even). This batching lets you manipulate and move the weights around less to serve the same amount of combined context.

Replies

spwa4 • 06/29/2025

If you add a dimension to the input vector you can do them independently and more efficiently. Look at this. Let's say you have a 2x2 network, and you apply it to an input vector of two values:

[i1 i2 ]⋅[w1 w3 ; w2 w4 ] = [i1 ⋅w1 +i2 ⋅w3 i1 ⋅w2 +i2 ⋅w4 ]

Cool. Now what happens if we make the input vector a 2x2 matrix with, for some reason, a second set of two input values:

[i1 i2 ; j1 j2 ]⋅[w1 w3 ; w2 w4 ] = [i1 ⋅ w1 +i2 ⋅ w3 i1 ⋅ w2 +i2 ⋅ w4 ; j1 ⋅ w1 +j2 ⋅ w3 j1 ⋅ w2 +j2 ⋅w4 ]

Look at that! The input has 2 rows, each row has an input value for the network and the output matrix has 2 rows, each containing the outputs for the respective inputs. So you can "just" apply your neural network to any number of input values by just putting one to each row. You could do 2, or 1000 this way ... and a number of values would only need to be calculated once.

menaerus • 06/24/2025

Batching isn't about "moving weights around less". Where do you move the weights anyway once they are loaded into the GPU VRAM? Batching, as always in CS problems, is about maximizing the compute for a unit of a single round trip, and in this case DMA-context-from-CPU-RAM-to-GPU-VRAM.

Self attention premise is exactly that it isn't context free so it is also incorrect to say that batched requests do not mathematically affect each other. They do, and that's by design.

➕ show 1 reply

alt Hacker News

Replies