logoalt Hacker News

simonwtoday at 3:10 PM3 repliesview on HN

Yes. I collected some details here: https://simonwillison.net/2026/Mar/18/llm-in-a-flash/


Replies

anemlltoday at 6:48 PM

Thanks for posting this, that's how I first found out about Dan's experiment! SSD speed doubled in the M5P/M generation, that makes it usable! I think one paper under the radar is "KV Prediction for Improved Time to First Token" https://arxiv.org/abs/2410.08391 which hopefully can help with prefill for Flash streaming.

show 1 reply
trebligdivadtoday at 10:03 PM

I guess this is all set up to show off the new high-bandwidth-flash stuff that's due out soon?

superjantoday at 5:23 PM

That was a very good summary. One detail the post could use is mentioning that 4 or 10 experts invoked where selected from the 512 experts the model has per layer (to give an idea of the savings).