It's a seriously degraded experience from a developer's perspective. Okay you've got one local LLM installed finally after configuring everything perfectly, what happens when you want to run a second instance? Now you've blown past your vram and system ram limits, and you're stuck to just one.
Furthermore, the model they recommend doesn't quite reach ~gpt-5.4-mini level performance- that quality dip means you may as well just pay for something like Kimi K2.6 via openrouter if you want a something ~>= sonnet 4.6 in performance as a backup for when you run out of anthropic/openai usage.
Not sure why you got downvoted. 95% of people should be paying for a subscription. It's far cheaper, far more scalable, and far less hassle.
Local AI only makes sense for a couple of use cases:
- Privacy
- Constant churning on tokens
- Latency
- Availability
Local AI is "cheaper" when you already have the hardware sitting around, like an old MacBook or gaming GPU, or the API cost (subscriptions will all run out if you churn 24/7) is too high to bare. I'm surprised companies are still selling their old MacBooks to employees, when they could be turning them into Beowulf clusters for cheap AI compute on long-running jobs (the cost is just electricity)If usage-based pricing is killing your vibe, find a cheaper subscription with higher limits. Here's a list of them compared on price-per-request-limit: https://codeberg.org/mutablecc/calculate-ai-cost/src/branch/...
Why are you running 2 instances anyways? If you want that workflow just rent a few ec2 gpu instances and fire away?
you e gott token addiction.
Your point about caliber/quality is fair, but I have been pretty astonished by some of the newer/better models (Gemma 4 variants, GPT-OSS before that).
However, there's not a lot of memory increase to have multiple sessions in parallel with one model. It's an HTTP server, and other than some caching, basically stateless.