For the occasional local LLM query, running locally probably won't make much of a dent in the b...

WatchDog • yesterday at 11:29 PM • 1 reply • view on HN

For the occasional local LLM query, running locally probably won't make much of a dent in the battery life, smaller models like mistral-7b can run at 258 tokens/s on an iPhone 17[0].

The reason why local LLMs are unlikely to displace cloud LLMs is memory footprint, and search. The most capable models require hundreds of GB of memory, impractical for consumer devices.

I run Qwen 3 2507 locally using llama-cpp, it's not a bad model, but I still use cloud models more, mainly due to them having good search RAG. There are local tools for this, but they don't work as well, this might continue to improve, but I don't think it's going to get better than the API integrations with google/bing that cloud models use.

[0]: https://github.com/ggml-org/llama.cpp/discussions/4508

Replies

ph4rsikal • today at 6:53 AM

I used Mistral 7B a lot in 2023. It was a good model then. Now its not anywhere near where SOTA models are.

alt Hacker News

Replies