LLMs need memory bandwidth to stream lots of data through quickly, not so much caching. Well, this i...

seanmcdirmid • today at 6:47 AM • 1 reply • view on HN

LLMs need memory bandwidth to stream lots of data through quickly, not so much caching. Well, this is basically the same way that a GPU uses memory.

Replies

zozbot234 • today at 11:21 AM

OTOH, LLM inference tends to have very predictable memory access patterns. So well-placed prefetch instructions that can execute predictable memory fetches in parallel with expensive compute might help CPU performance quite a bit. I assume that this is done already as part of optimized numerical primitives such as GEMM, since that's where most of the gain would be.

alt Hacker News

Replies