We need custom inference chips at scale for this imho. Every computer (whatever formfactor/board) should have an inference unit on it so at least inference is efficient and fast and can be offloaded while the cpu is doing something else.
Look at the specs of this Orange Pi 6+ board - dedicated 30 TPU NPU.
Almost all of them have it already. Microsoft's "Copilot+" branding includes a prerequisite for an NPU with a minimal amount of TOPS.
It's just that practically nothing uses those NPUs.
At this point of the timeline compute is cheap, it’s RAM which is basically unavailable.
I can't believe this was downvoted. It makes a lot of sense that it would be highly useful to have mass custom inference chips.
The bottleneck in common PC hardware is mostly memory bandwidth. Offloading the computation part to a different chip wouldn’t help if memory access is the bottleneck.
There have been a lot of boards and chips for years with dedicated compute hardware, but they’re only so useful for these LLM models that require huge memory bandwidth.