I've read this 10x and get more out of it each time. I certainly don't grok it yet, so I...

refulgentis • 10/11/2024 • 0 replies • view on HN

I've read this 10x and get more out of it each time.

I certainly don't grok it yet, so I might be wrong when I say its still crystal clear there's a little motte/bailey going on with "blame ARM for CUDA" vs. "ARM is shitty at SIMD vs. X86"

That aside, I'm building something that relies on llama.cpp for inference on every platform.

In this scenario, Android is de facto "ARM" to me.

The Vulkan backend doesn't support Android, or it does, and the 1-2 people who got it running see absurdly worse performance. (something something shaders, as far as I understand it)

iOS is de facto "not ARM" to me because it runs on the GPU.

I think llama.cpp isn't a great scenario for me to learn at the level you understand it, since it's tied to running a very particular kind of thing.

That aside, it was remarkable to me that my 13th gen Intel i5 framework laptop gets 2 tokens/sec on on iGPU and CPU. And IIUC, your comment explains that, in that "x86...[has] dedicated logic that keeps pace with SIMD...on [an integrated GPU]"

That aside, my Pixel Fold (read: 2022 mid-range Android CPU, should certainly be slower than 2023 Intel mid-upper range) kicks it around the block. 7 tkns/sec on CPU. 14 tkns/sec with NEON-layout.

Now, that aside, SVE was shown to double that again, indicating there's significant headroom on NEON. (https://github.com/ggerganov/llama.cpp/pull/9290) (I have ~0 idea what this is other than 'moar SIMD for ARM', for all I know, it's Amazon Graviton specific)

alt Hacker News