"WHAT IS MY PURPOSE?"
"You multiply matrices of INT8s."
"OH... MY... GOD"
NPUs really just accelerate low-precision matmuls. A lot of them are based on systolic arrays, which are like a configurable pipeline through which data is "pumped" rather than a general purpose CPU or GPU with random memory access. So they're a bit like the "synergistic" processors in the Cell, in the respect that they accelerate some operations really quickly, provided you feed them the right way with the CPU and even then they don't have the oomph that a good GPU will get you.
Do compilers know how to take advantage of that, or do programs need code that specifically takes advantage of that?
So it's a higher power DSP style device. Small transformers for flows. Sounds good for audio and maybe tailored video flow processing.
[flagged]
My question is: Isn't this exactly what SIMD has done before? Well, or SSE2 instructions?
To me, an NPU and how it's described just looks like a pretty shitty and useless FPGA that any alternative FPGA from Xilinx could easily replace.