That's not how this works. LLM serving at scale processes multiple requests in parallel for eff...

Aurornis • yesterday at 10:21 PM • 1 reply • view on HN

That's not how this works. LLM serving at scale processes multiple requests in parallel for efficiency. Reduce the parallelism and you can process individual requests faster, but the overall number of tokens processed is lower.

Replies

falloutx • yesterday at 10:27 PM

They can now easily decrease the speed for the normal mode, and then users will have to pay more for fast mode.

➕ show 2 replies

alt Hacker News

Replies