logoalt Hacker News

drnick1today at 5:17 AM2 repliesview on HN

Consider this. One of the smallest Qwen models (4B parameters) powers my home automation voice assistant, and runs on CPU alone at >20 tok/s. It is enough for that use case, and could be made even better/faster with a modest GPU. It isn't as smart as some cloud-connected thingamajig, but I would never allow a literal Google or Amazon bug in my home. Huge SOTA models aren't relevant everywhere. Most people use LLMs for rather trivial tasks such as finding typos or drafting text.


Replies

marcitoday at 6:54 AM

But with Apple's AFM 3 architecture, we might end up with huge SOTA adjacent on devices with limited RAM.

They use a technique where you only load between 1B and 4B of a 20B dense model for an entire prompt run, not token by token like a MoE, and use mostly the low power ANE instead of GPU cores.

Now, imagine if/when they scale up to 100B or more? On a chip using 2W?

show 1 reply
dainiussetoday at 6:25 AM

Curious, what exactly does it do for you? I has bad luck with these small models to do anything useful tbh.