logoalt Hacker News

rahimnathwaniyesterday at 11:25 PM3 repliesview on HN

The largest nodes in his cluster each have 512GB RAM. DeepSeek V3.1 is a 671B parameter model whose weights take up 700GB RAM: https://huggingface.co/deepseek-ai/DeepSeek-V3.1

I would have expected that going from one node (which can't hold the weights in RAM) to two nodes would have increased inference speed by more than the measured 32% (21.1t/s -> 27.8t/s).

With no constraint on RAM (4 nodes) the inference speed is less than 50% faster than with only 512GB.

Am I missing something?


Replies

elorantyesterday at 11:54 PM

You only get 80Gbps network bandwidth. There's your bottleneck right there. Infiniband in comparison can give you up to x10 times that.

show 1 reply
zeuskyesterday at 11:31 PM

the TB5 link (RDMA) is much slower than direct access to system memory

zozbot234yesterday at 11:56 PM

Weights are read-only data so they can just be memory mapped and reside on SSD (only a small fraction will be needed in VRAM at any given time), the real constraint is activations. MoE architecture should help quite a bit here.

show 3 replies