That assumes inference efficiency is static, which isn't really the case. Between aggressive qu...

storystarling • today at 8:05 AM • 0 replies • view on HN

That assumes inference efficiency is static, which isn't really the case. Between aggressive quantization, speculative decoding, and better batching strategies, the cost per token can vary wildly on the exact same hardware. I suspect the margins right now come from architecture choices as much as raw power costs.

alt Hacker News