logoalt Hacker News

vicchenaitoday at 5:34 PM3 repliesview on HN

the practical question is whether the read pattern is sequential enough to actually saturate nvme bandwidth or if the attention layer access pattern ends up being random enough to kill throughput. sequential reads on a decent nvme get you 5-7 GB/s, random reads drop to maybe 500 MB/s depending on queue depth.

for a 1T model youd need to stream something like 2TB of weights per forward pass at fp16. even at peak sequential thats 300+ seconds per token which is... not great for interactive use but maybe fine for batch inference where you dont care about latency.

still a cool proof of concept though. the gap between 'can run' and 'runs usefully' is where things get interesting.


Replies

p_ingtoday at 6:27 PM

4K random read with a queue depth of 1 on an M1 Max is about 65MB/s.

tateftoday at 6:25 PM

Yes, definitely agree. It's more of a POC than a functional use case. However, for many smaller MoE models this method can actually be useful and capable of achieving multiple tokens/sec.

zozbot234today at 5:46 PM

> for a 1T model youd need to stream something like 2TB of weights per forward pass

Isn't this missing the point of MoE models completely? MoE inference is sparse, you only read a small fraction of the weights per layer. You still have a problem of each individual expert-layer being quite small (a few MiBs each give or take) but those reads are large enough for the NVMe.

show 1 reply