This isn't a good analysis, and it's because it keeps rounding everything up. He rounds up the cost of electricity by 10%. He has a range of power use, takes the high end (which is 2x the low end) and multiplies it by the inflated electricity cost.
But then they talk about using a newly purchased Mac to do the inference, running at full capacity, 24/7. Why would you do that? Apple silicon is fast but the author points out: you're only getting 10-40 tokens per second. It's not bad, but it's not meant for this!
It's comparing apples to oranges. Yeah, data centers don't pay residential electricity rates. Data centers use chips that are power efficient. Data centers use chips that aren't designed to be a Mac.
Apple silicon works out pretty good if you're not burning tokens 24/7/365 and you're not buying hardware specifically to do it. I use my Mac Studio a few times a week for things that I need it for, but I can run ollama on it over the tailnet "for free". The economics work when I'm not trying to make my Mac Studio behave like a H100 cluster with liquid cooling. Which should come as no surprise to anyone: more tokens per watt on hardware that's multi tenant with cheap electricity will pretty much always win.
The article makes no sense. I can't use OpenRouter as a general purpose computing device. Why are we comparing a whole computer to a single purpose SaaS?
Rounded up, yes, and oddly inefficient for someone obsessed with inefficiency. One could buy a brand new 64gb M5 macbook for well over 4k. Another could buy a scratched up but functioning M1 Max 64gb off of ebay for a little over 1k—and somehow get the same 10-20 t/s with 31b that the author does with an M5. Or better yet, have a frontier model do the planning and judging, and have a local MOE model execute at 50 t/s. All of this achievable by a former English major with too much free time.
Your post makes sense if you bought the hardware for other reasons, and maybe run models occasionally as a novelty.
That isn't the case for many, though, and there is a whole social media space where people are hyping up the latest homebrew options for running models, believing it frees them from the yoke of big AI.
Millions of people are buying big $ maxed-out hardware like the Mac Studios or DGX specifically to run LLMs. Someone rationally running the numbers is a good thing.
using it 24/7 brings the average cost down, not up.
the less you use local LLM, the less sense it makes since you paid a lot for hardware you don't use
nothing about the current data center craze looks efficient.
Rounding everything down in the most optimistic setting got me to $0.40 per million tokens, and openrouter has the same model at $.38/mtok.