Yes. I collected some details here: https://simonwillison.net/2026/Mar/18/llm-in-a-flash/
I guess this is all set up to show off the new high-bandwidth-flash stuff that's due out soon?
That was a very good summary. One detail the post could use is mentioning that 4 or 10 experts invoked where selected from the 512 experts the model has per layer (to give an idea of the savings).
Thanks for posting this, that's how I first found out about Dan's experiment! SSD speed doubled in the M5P/M generation, that makes it usable! I think one paper under the radar is "KV Prediction for Improved Time to First Token" https://arxiv.org/abs/2410.08391 which hopefully can help with prefill for Flash streaming.