Doesn’t need to be a winner head to head. If it can do 90% of the tasks the big boys do, at 50% speed, for virtually no extra overhead cost save for the power consumed by a prompt - that’s gonna work for a lot of people. And that’s also basically where we’re at today. Qwen3.6 35b running quantized on 10 year old hardware solves basically all of my uses cases for agents except for coding.
The frontier models are faster, and better at coding, but not so much that i’ll pay $200/month for them.
> If it can do 90% of the tasks the big boys do, at 50% speed
I want to live in this world too, but these numbers, as of today, are very aspirational and far removed from reality.
I'm no tokenmaxxer; I find my modest local setup useful, I also know the limitations, it's slow and it sucks (relatively) at high-level and/or long-context planning, compared to frontier models. Only a minority of my prompts are max-effort - its not all I do, but, it also means frontier labs aren't dying any time soon
This is what makes sense for me as well. All I need a local model is for playing with simple graphics: no gradients, at most ten colours which I can push through VTracer to get an SVG. Draw Things does the job, usually in 120 seconds or less.
Sometimes, I need a quick throwaway bit of python. That can take 30 minutes of my time.
Consider this. One of the smallest Qwen models (4B parameters) powers my home automation voice assistant, and runs on CPU alone at >20 tok/s. It is enough for that use case, and could be made even better/faster with a modest GPU. It isn't as smart as some cloud-connected thingamajig, but I would never allow a literal Google or Amazon bug in my home. Huge SOTA models aren't relevant everywhere. Most people use LLMs for rather trivial tasks such as finding typos or drafting text.