logoalt Hacker News

NicuCalceatoday at 2:17 PM1 replyview on HN

The blog compares the cost of running Gemma4 31b, which on OpenRouter is offered by small no-name inference providers, not by frontier AI companies. It seems like a fair comparison to me.


Replies

porneltoday at 4:35 PM

LLM generation is bottlenecked by RAM bandwidth and latency. You can get almost linear scaling by evaluating more prompts in parallel, because the GPU has nothing to for the relative eternity it takes to read all of the weights from DRAM for every layer for every token.

On Apple Silicon you can get 4x-8x more tokens per second if you run more queries in parallel (as long as your inference server supports it, and has enough spare RAM for more KV caches).

When inference is done at datacenter scales, when you distribute generation across multiple GPUs and have kernels carefully tuned to specific hardware, the compute vs DRAM bandwidth speed ratio gets absurd like 200:1. That's why everyone gives you batch inference at a steep discount.