Basic Facts about GPUs

286 points • by ibobev • yesterday at 12:15 PM • 62 comments • view on HN

Comments

been running llama.cpp and vllm on same 4070, trying to batch more prompts for serving. llama.cpp was lagging bad once I hit batch 8 or so, even though GPU usage looked fine. vllm handled it way better.

later found vllm uses paged kv cache with layout that matches how the GPU wants to read fully coalesced without strided jumps. llama.cpp was using a flat layout that’s fine for single prompt but breaks L2 access patterns when batching.

reshaped kv tensors in llama.cpp to interleave ; made it [head, seq, dim] instead of [seq, head, dim], closer to how vllm feeds data into fused attention kernel. 2x speedup right there w.r.t same ops.

GPU was never the bottleneck. it was memory layout not aligning with SM’s expected access stride. vllm just defaults to layouts that make better use of shared memory and reduce global reads. that’s the real reason it scales better per batch.

this took its own time of say 2+days and had to dig under the nice looking GPU graphs to find real bottlenecks, it was widly trial and error tbf,

> anybody got idea on how to do this kinda experiment in hot reload mode without so much hassle??

➕ show 5 replies

elashri • yesterday at 2:52 PM

Good article summarizing good chunk of information that people should have some idea about. I just want to comment that the title is a little bit misleading because this is talking about the very choices that NVIDIA follows in developing their GPU archs which is not what always what others do.

For example, the arithmetic intensity break-even point (ridge-point) is very different once you leave the NVIDIA-land. If we take AMD Instinct MI300, it has up to 160 TFLOPS FP32 paired with ~6 TB/s of HBM3/3E bandwidth gives a ridge-point near 27 FLOPs/byte which is about double that of the A100’s 13 FLOPs/byte. The larger on-package HBM (128 – 256 GB) GPU memory also shifts the practical trade-offs between tiling depth and occupancy. Although this is very expensive and does not have CUDA (which can be good and bad at the same time).

➕ show 1 reply

eapriv • yesterday at 3:09 PM

Spoiler: it’s not about how GPUs work, it’s about how to use them for machine learning computations.

➕ show 1 reply

LarsDu88 • yesterday at 7:40 PM

Maybe this should be titled "Basic Facts about Nvidia GPUs" as the WARP terminology is a feature of modern Nvidia GPUs.

Again, I emphasize "modern"

An NVIDIA GPU from circa 2003 is completely different and has baked in circuitry specific to the rendering pipelines used for videogames at that time.

So most of this post is not quite general to all "GPUs" which a much broader category of devices that don't necessarily encompass the type of general purpose computation we use modern Nvidia GPUs for.

bjornsing • today at 5:20 AM

So how are we doing with whole program optimization on the compiler level? Feels kind of backwards that people are optimizing these LLM architectures, one at a time.

SoftTalker • yesterday at 2:16 PM

Contrasting colors. Use them!

➕ show 4 replies

geoffbp • today at 5:40 AM

“Arithmetic Intensity (AI)”

Hmm

kittikitti • yesterday at 1:58 PM

This is a really good introduction and I appreciate it. When I was building my AI PC, the deep dive research into GPU's took a few days but this lays it out in front of me. It's especially great because it touches on high-value applications like generative artificial intelligence. A notable diagram from the page that I wasn't able to find represented well elsewhere was the memory hierarchy of the A100 GPU's. The diagrams were very helpful. Thank you for this!

naganotonicbuy • today at 7:41 AM

[dead]

neuroelectron • yesterday at 7:09 PM

ASCII diagrams, really?

alt Hacker News

Basic Facts about GPUs

Comments