> The marginal cost of an API call is small relative to what users pay, and utilization rates at scale are pretty high.
How do you know this?
> You don't need perfect certainty about GPU lifespan to see that the spread between cost-per-token and revenue-per-token leaves a lot of room.
You can't even speculate this spread without knowing even a rough idea of cost-per-token. Currently, it's total paper math on what the cost-per-token is.
> And datacenter GPUs have been running inference workloads for years now,
And inference resource intensity is a moving target. If a new model comes out that requires 2x the amount of resources now.
> They're not throwing away two-year-old chips.
Maybe, but they'll be replaced by either (a) a higher performance GPU that can deliver the same results with less energy, less physical density, and less cooling or (b) the extended support costs becomes financially untenable.
If a model costs them 2x as much, they charge 2x as much. That much is clear from their API pricing.