logoalt Hacker News

storystarlingtoday at 8:05 AM0 repliesview on HN

That assumes inference efficiency is static, which isn't really the case. Between aggressive quantization, speculative decoding, and better batching strategies, the cost per token can vary wildly on the exact same hardware. I suspect the margins right now come from architecture choices as much as raw power costs.