logoalt Hacker News

dev_l1x_betoday at 1:41 PM1 replyview on HN

How do you split the model between multiple GPUs?


Replies

evilducktoday at 2:10 PM

With "only" 32B active params, you don't necessarily need to. We're straying from common home users to serious enthusiasts and professionals but this seems like it would run ok on a workstation with a half terabyte of RAM and a single RTX6000.

But to answer your question directly, tensor parallelism. https://github.com/ggml-org/llama.cpp/discussions/8735 https://docs.vllm.ai/en/latest/configuration/conserving_memo...