The cost per token served has been falling steadily over the past few years across basically all of ...

simonw • yesterday at 6:01 PM • 4 replies • view on HN

The cost per token served has been falling steadily over the past few years across basically all of the providers. OpenAI dropped the price they charged for o3 to 1/5th of what it was in June last year thanks to "engineers optimizing inferencing", and plenty of other providers have found cost savings too.

Turns out there was a lot of low-hanging fruit in terms of inference optimization that hadn't been plucked yet.

> A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers

Where did you hear that? It doesn't match my mental model of how this has played out.

Replies

cootsnuck • yesterday at 6:11 PM

I have not see any reporting or evidence at all that Anthropic or OpenAI is able to make money on inference yet.

> Turns out there was a lot of low-hanging fruit in terms of inference optimization that hadn't been plucked yet.

That does not mean the frontier labs are pricing their APIs to cover their costs yet.

It can both be true that it has gotten cheaper for them to provide inference and that they still are subsidizing inference costs.

In fact, I'd argue that's way more likely given that has been precisely the goto strategy for highly-competitive startups for awhile now. Price low to pump adoption and dominate the market, worry about raising prices for financial sustainability later, burn through investor money until then.

What no one outside of these frontier labs knows right now is how big the gap is between current pricing and eventual pricing.

➕ show 4 replies

replwoacause • today at 12:39 AM

My experience trying to use Opus 4.5 on the Pro plan has been terrible. It blows up my usage very very fast. I avoid it altogether now. Yes, I know they warn about this, but it's comically fast how quickly it happens.

nubg • yesterday at 6:05 PM

> "engineers optimizing inferencing"

are we sure this is not a fancy way of saying quantization?

➕ show 5 replies

sumitkumar • yesterday at 6:49 PM

It seems it is true for gemini because they have a humongous sparse model but it isn't so true for the max performance opus-4.5/6 and gpt-5.2/3.

alt Hacker News

Replies