logoalt Hacker News

Const-me10/11/20240 repliesview on HN

Note that not all problems are compute bound. Many practical problems bottleneck on memory bandwidth.

For example, LLM AI inference on a desktop (where you don’t have a dozen of concurrent sessions from multiple users) is guaranteed to be memory bound, fetching these gigabytes of model’s tensors for each generated token. For use cases like that, specialized tensor cores deliver about the same performance as well-written compute shaders running on general purpose GPU cores.

However, AVX512 is way slower than GPUs, because modern GPUs have memory with very high bandwidth. In my desktop computer the system memory is dual channel DDR5 which delivers 75 GB/s, VRAM in the discrete GPU 670 GB/sec.