logoalt Hacker News

xfalcox10/11/20241 replyview on HN

What about FP8? It is a target that is very popular for LLM inference.


Replies

adrian_b10/11/2024

AMD Zen 5 has the so-called "Vector Neural Network Instructions", which can be used for inference with INT8 quantization and also instructions for computing inference with BF16 quantization.

FP8 is a more recent quantization format and AFAIK no CPU implements it.

I do not know which is the throughput of these instructions for Zen 5. It must be higher than for older CPUs, but it must be slower than for the Intel Xeon models that support AMX (which are much more expensive, so despite having a higher absolute performance for inference, they might have lower performance per dollar) and obviously it must be slower than for the tensor cores of a big NVIDIA GPU.

Nevertheless, for models that do not fit inside the memory of a GPU, inference on a Zen 5 CPU may become competitive.