logoalt Hacker News

zozbot234today at 1:16 PM0 repliesview on HN

Loading experts to system memory is supported by most local-AI frameworks. But you do not gain much by running that part of the decode on GPU, since decode is not compute-limited and the CPU-GPU transfer involves overhead. It's best to use the GPU for speeding up the shared part of the model.