I'd love to see a thorough breakdown of what these local NPUs can really do. I've had friends ask me about this (as the resident computer expert) and I really have no idea. Everything I see advertised for (blurring, speech to text, etc...) are all things that I never felt like my non-NPU machine struggled with. Is there a single remotely killer application for local client NPUs?
The problem is essentially memory bandiwdth afiak. Simplifying a lot in my reply, but most NPUs (all?) do not have faster memory bandwidth than the GPU. They were originally designed when ML models were megabytes not gigabytes. They have a small amount of very fast SRAM (4MB I want to say?). LLM models _do not_ fit into 4MB of SRAM :).
And LLM inference is heavily memory bandwidth bound (reading input tokens isn't though - so it _could_ be useful for this in theory, but usually on device prompts are very short).
So if you are memory bandwidth bound anyway and the NPU doesn't provide any speedup on that front, it's going to be no faster. But has loads of other gotchas so no real "SDK" format for them.
Note the idea isn't bad per se, it has real efficiencies when you do start getting compute bound (eg doing multiple parallel batches of inference at once), this is basically what TPUs do (but with far higher memory bandwidth).
In theory NPUs are a cheap, efficient alternative to the GPU for getting good speeds out of larger neural nets. In practice they're rarely used because for simple tasks like blurring, speech to text, noise cancellation, etc you can get usually do it on the CPU just fine. For power users doing really hefty stuff they usually have a GPU anyway so that gets used because it's typically much faster. That's exactly what happens with my AMD AI Max 395+ board. I thought maybe the GPU and NPU could work in parallel but memory limitations mean that's often slower than just using the GPU alone. I think I read that their intended use case for the NPU is background tasks when the GPU is already loaded but that seems like a very niche use case.
> Everything I see advertised for (blurring, speech to text, etc...) are all things that I never felt like my non-NPU machine struggled with.
I don’t know how good these neural engines are, but transistors are dead-cheap nowadays. That makes adding specialized hardware a valuable option, even if it doesn’t speed up things but ‘only’ decreases latency or power usage.
I think a lot of it is just power savings on those features, since the dedicated silicon can be a lot more energy efficient even if it's not much more powerful.
"WHAT IS MY PURPOSE?"
"You multiply matrices of INT8s."
"OH... MY... GOD"
NPUs really just accelerate low-precision matmuls. A lot of them are based on systolic arrays, which are like a configurable pipeline through which data is "pumped" rather than a general purpose CPU or GPU with random memory access. So they're a bit like the "synergistic" processors in the Cell, in the respect that they accelerate some operations really quickly, provided you feed them the right way with the CPU and even then they don't have the oomph that a good GPU will get you.
I used to work at Intel until recently. Pat Gelsinger (the prior CEO) had made one of the top goals for 2024 the marketing of the "AI PC".
Every quarter he would have an all company meeting, and people would get to post questions on a site, and they would pick the top voted questions to answer.
I posted mine: "We're well into the year, and I still don't know what an AI PC is and why anyone would want it instead of a CPU+GPU combo. What is an AI PC and why should I want it?" I then pointed out that if a tech guy like me, along with all the other Intel employees I spoke to, cannot answer the basic questions, why would anyone out there want one?
It was one of the top voted questions and got asked. He answered factually, but it still wasn't clear why anyone would want one.