logoalt Hacker News

adrian_btoday at 12:03 AM1 replyview on HN

Until now I have not run models that do not fit in 128 GB.

I have an Epyc server with 128 GB of high-throughput DRAM, which also has 2 AMD GPUs with 16 GB of DRAM each.

Until now I have experimented only with models that can fit in this memory, e.g. various medium-size Qwen and Gemma models, or gpt-oss.

But I am curious about how bigger models behave, e.g. GLM-5.1, Qwen3.5-397B-A17B, Kimi-K2.6, DeepSeek-V3.2, MiniMax-M2.7. I am also curious about how the non-quantized versions of the models with around 120B parameters behave, e.g such versions of Nemotron and Qwen. It is said that quantization to 8 bits or even to 4 bits has negligible effects, but I want to confirm this with my own tests.

There is no way to test big models or non-quantized medium models at a reasonable cost, otherwise than with weights read from SSDs. For some tasks, it may be preferable to use a big model at a slow speed, if that means that you need less attempts to obtain something useful. For a coding assistant, it may be possible to batch many tasks, which will progress simultaneously during a single pass over the SSD data.

For now I am studying llama.cpp in order to determine how it can be modified to achieve the maximum performance that could be reached with SSDs.


Replies

zozbot234today at 6:14 AM

AIUI, the main obstacle to maximizing performance with SSD offload is that existing GGUF files for MoE models are not necessarily laid out so that fetching a single MoE layer-expert can be done by reading a single sequential extent off the file. It may be that the GGUF format is already flexible enough in its layout configuration that this is doable with a simple conversion; but if not, the GGUF specification would have to be extended to allow such a layout to be configured.

show 1 reply