Loading experts to system memory is supported by most local-AI frameworks. But you do not gain much...

zozbot234 • today at 1:16 PM • 0 replies • view on HN

Loading experts to system memory is supported by most local-AI frameworks. But you do not gain much by running that part of the decode on GPU, since decode is not compute-limited and the CPU-GPU transfer involves overhead. It's best to use the GPU for speeding up the shared part of the model.

alt Hacker News