logoalt Hacker News

falloutxyesterday at 10:09 PM2 repliesview on HN

Thats also called slowing down default experience so users have to pay more for the fast mode. I think its the first time we are seeing blatant speed ransoms in the LLMs.


Replies

Aurornisyesterday at 10:21 PM

That's not how this works. LLM serving at scale processes multiple requests in parallel for efficiency. Reduce the parallelism and you can process individual requests faster, but the overall number of tokens processed is lower.

show 1 reply
throw310822yesterday at 10:12 PM

Slowing down respect to what?

show 1 reply