We still have the problem that auto regressive decoders are memory bound. The new Blackwell hardwa...

nyrikki • yesterday at 9:30 PM • 1 reply • view on HN

We still have the problem that auto regressive decoders are memory bound.

The new Blackwell hardware combined with TensorRT-LLM and speculative decoding consistently can hit 1,000 TPS/user barrier, comparing to closer to ~250 TPS/user (out of 10k+/TPS on the server)

Is there something I missed, this looks more like 14.4 to 56 on a 64kbps backing channel modem story. I have no doubt that there are still massive gains to be found, but they seem to be using existing constraints more efficiently, not that fios is coming.

I don’t have the budget to work on the foundational model scale, but with a draft model 10x–20x faster than target and an 60-80 acceptance rate I can see how they could promise 750/TPS (with a lot of other hard work) but I would appreciate where I should look to figure out what I am missing.

Replies

rsalus • yesterday at 10:11 PM

agree, from my POV the constraints are still there but we've optimized now. still haven't solved the core problems.

alt Hacker News

Replies