There is no use case for local models.
See Gemini Nano. It is available in custom apps, but the results are so bad; factual errors and hallucinations make it useless. I can see why Google did not roll it out to users.
Even if it was significantly better, inference is still slow. Adding a few milliseconds of network latency for contacting a server and getting a vastly superior result is going to be preferable in nearly all scenarios.
Arguments can be made for privacy or lack of connectivity, but it probably does not matter to most people.
I think the real case is a future technology. Similar to speculative decoding but done over servers.
Local model answers and reaches into the cloud for hard tokens.
I just want it to be able to control my apple home devices and trigger shortcuts, and maybe do a search into a few apps and find things. I know a local model can understand my intent for siri like operations because I literally have my own version of that on my laptop.