logoalt Hacker News

rohansood15today at 5:05 AM1 replyview on HN

The title is Apple Silicon costs LESS than OpenRouter. Not sure why it got updated to this - maybe because I referenced the original HN post?

Here's the full post:

TLDR; When you consider batching, cache and input tokens, together with the residual cost of Macbook Pro is actually 14% cheaper than OpenRouter. This becomes a whooping 3x (i.e. 65%) cheaper if you consider MoE models like Gemma 4 26B.

There was a well-meaning post yesterday by @DataDrivenAngel comparing costs of self-hosting LLMs v/s using OpenRouter (HN link). The analysis however had a few flaws as pointed out by the HN community, and I ran benchmarks on my M4 Max 128GB to adjust for those.

1. The estimate was based entirely using output tokens, instead of real-world input-output token mix. The numbers look very different if you consider a 4:1 or 5:1 input to output token ratio.

2. Batching/concurrency/caching improves token throughput, and if you're running multiple coding agents/work trees the performance gain can be significant.

3. A Macbook Pro is an asset purchase, and retains significant residual value through it's life. Probably not unreasonable to expect ~1.5-2.5k resale value after 3-5 years of use.

I ran vllm bench using a resonable approximation for a coding agent workload with concurrency 4 for Gemma 4 31B (same as the original post), and got the following results:

-----------------------------------

Serving Benchmark Gemma 4 31B Successful requests: 20 Maximum request concurrency: 4 Benchmark duration (s): 263.19 Total input tokens: 35000 Total generated tokens: 6400 Request throughput (req/s): 0.08 Output token throughput (tok/s): 24.32 Peak output token throughput (tok/s): 36 Peak concurrent requests: 8 Total token throughput (tok/s): 157.3

Scenario 3 years $0.15 Local cheaper (~6%) 5 years $0.14 Local cheaper (~13%) 7 years $0.13 Local cheaper (~19%)

-----------------------------------

Once you work out the math (using original assumptions on power costs and 5 year timeline), you get to a blended cost of ~$0.14 per million tokens for local, v/s ~$0.16 for OpenRouter. That is not a massive win. But it is close enough to flip the narrative from local being more expensive to 'it depends'.

But it doesn't end there. If you used an MoE model like Gemma 4 26B, the blended cost drops to $0.038 per million tokens, v/s OpenRouter's $0.1 per million. That is a ~3x difference.

-----------------------------------

Serving Benchmark Gemma 4 26B (MoE) Successful requests: 20 Maximum request concurrency: 4 Benchmark duration (s): 60.05 Total input tokens: 30002 Total generated tokens: 4870 Request throughput (req/s): 0.33 Output token throughput (tok/s): 81.1 Peak output token throughput (tok/s): 128 Peak concurrent requests: 8 Total token throughput (tok/s): 580.72

Scenario 3 years $0.040 Local cheaper (~60%) 5 years $0.038 Local cheaper (~62%) 7 years $0.035 Local cheaper (~65%)

-----------------------------------

This is not meant as an attack on the original analysis - I am sure the synthetic bench I used has a few holes, plus buying price/residual value varies a fair bit. Plus, I don't think anybody will run their MBP for inference for 5 years straight. But with worsening GPU supply and the inevitable price/access squeeze, I think local LLMs have a huge role to play. And this is on top of the privacy benefits. A misperceived price differential should not be the reason that slows down adoption.


Replies

jmalickitoday at 5:19 AM

The tweet does not make clear what the power cost assumptions are? That is wildly variable and important! For some people it may be, perhaps not for others.

show 1 reply