logoalt Hacker News

EnPissanttoday at 5:53 AM2 repliesview on HN

Streaming weights from RAM to GPU for prefill makes sense due to batching and pcie5 x16 is fast enough to make it worthwhile.

Streaming weights from RAM to GPU for decode makes no sense at all because batching requires multiple parallel streams.

Streaming weights from SSD _never_ makes sense because the delta between SSD and RAM is too large. There is no situation where you would not be able to fit a model in RAM and also have useful speeds from SSD.


Replies

simonwtoday at 6:23 AM

There have been some very interesting experiments with streaming from SSD recently: https://simonwillison.net/2026/Mar/18/llm-in-a-flash/