logoalt Hacker News

wolfgangKlast Saturday at 9:40 PM3 repliesview on HN

For LLM inference, I don't think the PCIe bandwidth matters much and a GPU could improve greatly the prompt processing speed.


Replies

zozbot234last Sunday at 10:40 AM

The Strix Halo iGPU is quite special, like the Apple iGPU it has such good memory bandwidth to system RAM that it manages to improve both prompt processing and token generation compared to pure CPU inference. You really can't say that about the average iGPU or low-end dGPU: usually their memory bandwidth is way too anemic, hence the CPU wins when it comes to emitting tokens.

ElectricalUnionlast Saturday at 9:44 PM

Only if your entire model fits the GPU VRAM.

To me this reads like "if you can afford those 256GB VRAM GPUs, you don't need PCIe bandwidth!"

show 2 replies
jgalt212last Saturday at 9:45 PM

Yeah, I think so. Once the whole model is on the GPU (potentially slower start-up), there really isn't much traffic between the GPU and the motherboard. That's how I think about it. But mostly saying this as I'm interested in being corrected if I'm wrong.