Yes. I collected some details here:

simonw • today at 3:10 PM • 3 replies • view on HN

Yes. I collected some details here: https://simonwillison.net/2026/Mar/18/llm-in-a-flash/

Replies

Thanks for posting this, that's how I first found out about Dan's experiment! SSD speed doubled in the M5P/M generation, that makes it usable! I think one paper under the radar is "KV Prediction for Improved Time to First Token" https://arxiv.org/abs/2410.08391 which hopefully can help with prefill for Flash streaming.

➕ show 1 reply

trebligdivad • today at 10:03 PM

I guess this is all set up to show off the new high-bandwidth-flash stuff that's due out soon?

superjan • today at 5:23 PM

That was a very good summary. One detail the post could use is mentioning that 4 or 10 experts invoked where selected from the 512 experts the model has per layer (to give an idea of the savings).

alt Hacker News

Replies