logoalt Hacker News

Aurornisyesterday at 10:21 PM1 replyview on HN

That's not how this works. LLM serving at scale processes multiple requests in parallel for efficiency. Reduce the parallelism and you can process individual requests faster, but the overall number of tokens processed is lower.


Replies

falloutxyesterday at 10:27 PM

They can now easily decrease the speed for the normal mode, and then users will have to pay more for fast mode.

show 2 replies