A year ago this would have been considered impossible. The hardware is moving faster than anyone's software assumptions.
The software has real software engineers working on it instead of researchers.
Remember when people were arguing about whether to use mmap? What a ridiculous argument.
At some point someone will figure out how to tile the weights and the memory requirements will drop again.
It wasn't considered impossible. There are examples of large MoE LLMs running on small hardware all over the internet, like giant models on Raspberry Pi 5.
It's just so slow that nobody pursued it seriously. It's fun to see these tricks implemented, but even on this 2025 top spec iPhone Pro the output is 100X slower than output from hosted services.
/FIFY A year ago this would have been considered impossible. The software is moving faster than anyone's hardware assumptions.
I mean, by any reasonable standard it still is. Almost any computer can run an llm, it's just a matter of how fast, and 0.4k/s (peak before first token) is not really considered running. It's a demo, but practically speaking entirely useless.
Does iPhone have some kind of hardware acceleration for neural netwoeks/ai ?
This isn't a hardware feat, this is a software triumph.
They didn't make special purpose hardware to run a model. They crafted a large model so that it could run on consumer hardware (a phone).