logoalt Hacker News

zozbot234today at 1:24 PM0 repliesview on HN

Even small NPUs can offload some compute from prefill which can be quite expensive with longer contexts. It's less clear whether they can help directly during decode; that depends on whether they can access memory with good throughput and do dequant+compute internally, like GPUs can. Apple Neural Engine only does INT8 or FP16 MADD ops, so that mostly doesn't help.