Been running lemonade for some time on my Strix Halo box. It dispatches out to other backends that t...

rpdillon • today at 1:13 PM • 1 reply • view on HN

Been running lemonade for some time on my Strix Halo box. It dispatches out to other backends that they include, like diffusion and llama. I actually don't like their combined server, and what I use instead is their llama CPP build for ROCm.

https://github.com/lemonade-sdk/llamacpp-rocm

But I'm not doing anything with images or audio. I get about 50 tokens a second with GPT OSS 120B. As others have pointed out, the NPU is used for low-powered, small models that are "always on", so it's not a huge win for the standard chatbot use case.

Replies

zozbot234 • today at 1:24 PM

Even small NPUs can offload some compute from prefill which can be quite expensive with longer contexts. It's less clear whether they can help directly during decode; that depends on whether they can access memory with good throughput and do dequant+compute internally, like GPUs can. Apple Neural Engine only does INT8 or FP16 MADD ops, so that mostly doesn't help.

alt Hacker News

Replies