logoalt Hacker News

jychanglast Sunday at 10:45 AM0 repliesview on HN

No, that's not true. Prompt processing just needs attention tensors in VRAM, the MLP weights aren't needed for the heavy calculations that a GPU speeds up. (After attention, you only need to pass the activations from GPU to system RAM, which is about ~40KB so you're not very limited here).

That's pretty small.

Even Deepseek R1 0528 685b only has like ~16GB of attention weights. Kimi K2 with 1T parameters has 6168951472 attention params, which means ~12GB.

It's pretty easy to do prompt processing for massive models like Deepseek R1, Kimi K2, or Qwen 3 235b with only a single Nvidia 3090 gpu. Just do --n-cpu-moe 99 in llama.cpp or something similar.