Even small NPUs can offload some compute from prefill which can be quite expensive with longer conte...

zozbot234 • today at 1:24 PM • 0 replies • view on HN

Even small NPUs can offload some compute from prefill which can be quite expensive with longer contexts. It's less clear whether they can help directly during decode; that depends on whether they can access memory with good throughput and do dequant+compute internally, like GPUs can. Apple Neural Engine only does INT8 or FP16 MADD ops, so that mostly doesn't help.

alt Hacker News