logoalt Hacker News

talldayo10/11/20241 replyview on HN

> GPUs (CUDA) are orthogonal to consumer processors (ARM / X86).

We're talking about vector operations. CUDA is not a GPU but a library of hardware-accelerated functions, not necessarily different from OpenCL or even NEON for ARM. You can reimplement everything CUDA does on a CPU, and if you're using a modern CPU you can vectorize it too. x86 handles this well, because it's still got dedicated logic that keeps pace with the SIMD throughput an integrated GPU might offer. ARM leaves out the operations entirely (which is smart for efficiency), and therefore either relies on someone porting CUDA code to an ARM GPU shader (fat chance) or offloading to a remote GPU. It's why ARM is excellent for sustained simple ops but collapses when you benchmark it bruteforcing AI or translating AVX to NEON. SIMD is too much for a base-spec ARM core.

> Maybe we could assume a platonic ideal merged chip, a CPU that acts like a GPU, but there's more differences between those two things than an instruction set for vector ops.

Xeon Phi or Itanium flavored?


Replies

refulgentis10/11/2024

I've read this 10x and get more out of it each time.

I certainly don't grok it yet, so I might be wrong when I say its still crystal clear there's a little motte/bailey going on with "blame ARM for CUDA" vs. "ARM is shitty at SIMD vs. X86"

That aside, I'm building something that relies on llama.cpp for inference on every platform.

In this scenario, Android is de facto "ARM" to me.

The Vulkan backend doesn't support Android, or it does, and the 1-2 people who got it running see absurdly worse performance. (something something shaders, as far as I understand it)

iOS is de facto "not ARM" to me because it runs on the GPU.

I think llama.cpp isn't a great scenario for me to learn at the level you understand it, since it's tied to running a very particular kind of thing.

That aside, it was remarkable to me that my 13th gen Intel i5 framework laptop gets 2 tokens/sec on on iGPU and CPU. And IIUC, your comment explains that, in that "x86...[has] dedicated logic that keeps pace with SIMD...on [an integrated GPU]"

That aside, my Pixel Fold (read: 2022 mid-range Android CPU, should certainly be slower than 2023 Intel mid-upper range) kicks it around the block. 7 tkns/sec on CPU. 14 tkns/sec with NEON-layout.

Now, that aside, SVE was shown to double that again, indicating there's significant headroom on NEON. (https://github.com/ggerganov/llama.cpp/pull/9290) (I have ~0 idea what this is other than 'moar SIMD for ARM', for all I know, it's Amazon Graviton specific)