On-device moves all compute cost (incl. electricity) to the consumer. I.e., as of 2025 that means much less battery life, a much warmer device, and much higher electricity costs. Unless the M-series can do substantially more with less this is a dead end.
For the occasional local LLM query, running locally probably won't make much of a dent in the battery life, smaller models like mistral-7b can run at 258 tokens/s on an iPhone 17[0].
The reason why local LLMs are unlikely to displace cloud LLMs is memory footprint, and search. The most capable models require hundreds of GB of memory, impractical for consumer devices.
I run Qwen 3 2507 locally using llama-cpp, it's not a bad model, but I still use cloud models more, mainly due to them having good search RAG. There are local tools for this, but they don't work as well, this might continue to improve, but I don't think it's going to get better than the API integrations with google/bing that cloud models use.
Battery isn't relevant to plugged-in devices, and in the end, electricity costs roughly the same to generate and deliver to a data center as to a home. The real cost advantage that cloud has is better amortization of hardware since you can run powerful hardware at 100% 24/7 spread across multiple people. I wouldn't bet on that continuing indefinitely, consumer hardware tends to catch up to HPC-exclusive workloads eventually.
Apple runs all the heavy compute stuff overnight when your device is plugged in. The cost of the electricity is effectively nothing. And there is no impact on your battery life or device performance.
That's fair for brute force (running a model on the GPU), but that's exactly where NPUs come in - they are orders of magnitude more energy-efficient for matrix operations than GPUs. Apple has been putting NPUs in every chip for years for a reason. For short, bursty tasks (answer a question, generate an image), the battery impact will be minimal. It's not 24/7 crypto mining, it's impulse load