> Will LLMs be cheaper than humans once the subsidies for tokens go away? At this point we hav...

chadash • today at 4:47 PM • 11 replies • view on HN

> Will LLMs be cheaper than humans once the subsidies for tokens go away? At this point we have little visibility to what the true cost of tokens is now, let alone what it will be in a few years time. It could be so cheap that we don’t care how many tokens we send to LLMs, or it could be high enough that we have to be very careful.

We do have some idea. Kimi K2 is a relatively high performing open source model. People have it running at 24 tokens/second on a pair of Mac Studios, which costs 20k. This setup requires less than a KW of power, so the $0.8-0.15 being spent there is negligible compared to a developer. This might be the cheapest setup to run locally, but it's almost certain that the cost per token is far cheaper with specialized hardware at scale.

In other words, a near-frontier model is running at a cost that a (somewhat wealthy) hobbyist can afford. And it's hard to imagine that the hardware costs don't come down quite a bit. I don't doubt that tokens are heavily subsidized but I think this might be overblown [1].

[1] training models is still extraordinarily expensive and that is certainly being subsidized, but you can amortize that cost over a lot of inference, especially once we reach a plateau for ideas and stop running training runs as frequently.

Replies

embedding-shape • today at 4:55 PM

> a near-frontier model

Is Kimi K2 near-frontier though? At least when run in an agent harness, and for general coding questions, it seems pretty far from it. I know what the benchmarks say, they always say it's great and close to frontier models, but is this other's impression in practice? Maybe my prompting style works best with GPT-type models, but I'm just not seeing that for the type of engineering work I do, which is fairly typical stuff.

➕ show 2 replies

lambda • today at 5:13 PM

You don't even need to go this expensive. An AMD Ryzen Strix Halo (AI Max+ 395) machine with 128 GiB of unified RAM will set you back about $2500 these days. I can get about 20 tokens/s on Qwen3 Coder Next at an 8 bit quant, or 17 tokens per second on Minimax M2.5 at a 3 bit quant.

Now, these models are a bit weaker, but they're in the realm of Claude Sonnet to Claude Opus 4. 6-12 months behind SOTA on something that's well within a personal hobby budget.

➕ show 3 replies

consp • today at 5:04 PM

20k for such a setup for a hobbyist? You can leave the somewhat away and go into sub 1% region globally. A kw of power is still 2k/year at least for me, not that I expect it will run continuously but still not negligible if you can do with 100-200 a year on cheap subscriptions.

➕ show 2 replies

manwe150 • today at 5:22 PM

Reminder to others that $20k is the one time startup cost, and is amortized perhaps 2-4k/year (plus power). That is in the realm of a mere gym membership around me for a family

➕ show 1 reply

lm28469 • today at 7:55 PM

90% of companies would go bankrupt in a year if you replaced their engineering team with execs talking to k2...

➕ show 1 reply

blibble • today at 6:21 PM

> And it's hard to imagine that the hardware costs don't come down quite a bit.

have you paid any attention to the hardware situation over the last year?

this week they've bought up the 2026 supply of disks

newsoftheday • today at 5:04 PM

> a cost that a (somewhat wealthy) hobbyist can afford

$20,000 is a lot to drop on a hobby. We're probably talking less than 10%, maybe less than 5% of all hobbyists could afford that.

➕ show 2 replies

msp26 • today at 6:11 PM

Horrific comparison point. LLM inference is way more expensive locally for single users than running batch inference at scale in a datacenter on actual GPUs/TPUs.

➕ show 1 reply

qaq • today at 5:29 PM

If I remember correctly Dario had claimed that AI inference gross profit margins are 40%-50%

➕ show 1 reply

PlatoIsADisease • today at 5:11 PM

>24 tokens/second

this is marketing not reality.

Get a few lines of code and it becomes unusable.

alt Hacker News

Replies