logoalt Hacker News

bob1029yesterday at 8:43 PM6 repliesview on HN

At a certain rate we will be able to move towards continuous / real-time inference systems. The discrete, turn based solutions are quite confining with how they must be trained. Continuous and real-time would fundamentally alter the domain.

From an information theory perspective we are still in dial-up territory with regard to the actual information rate. 750 tokens per second would be a really bad dialup connection. Imagine 10 millions tokens per second.


Replies

nyrikkiyesterday at 9:30 PM

We still have the problem that auto regressive decoders are memory bound.

The new Blackwell hardware combined with TensorRT-LLM and speculative decoding consistently can hit 1,000 TPS/user barrier, comparing to closer to ~250 TPS/user (out of 10k+/TPS on the server)

Is there something I missed, this looks more like 14.4 to 56 on a 64kbps backing channel modem story. I have no doubt that there are still massive gains to be found, but they seem to be using existing constraints more efficiently, not that fios is coming.

I don’t have the budget to work on the foundational model scale, but with a draft model 10x–20x faster than target and an 60-80 acceptance rate I can see how they could promise 750/TPS (with a lot of other hard work) but I would appreciate where I should look to figure out what I am missing.

show 1 reply
mikepurvisyesterday at 9:11 PM

Is there anyone exploring or writing about this in public? I've felt for a while that the turn-based model was not quite right, but also felt too stupid and ill-informed to have much of an opinion about what else it could be.

show 2 replies
dennisyyesterday at 8:54 PM

That would be interesting.

Do you feel most of the speed upgrade will come from the software or hardware side?

dyauspitryesterday at 9:30 PM

And more importantly those 10 million tokens/s should cost fractions of a penny. Tokens need to be dirt cheap so I hope they build out massive solar+battery powered data centers asap.

b112yesterday at 9:23 PM

Your comment made me think of another real time. Real time, dynamic code/apis.

Imagine a world where there is no code, just things mildly handshaking and then creating data APIs on the fly. Where communication is fuzzy and locked in on an individual basis. No years of RFCs, no RFCs at all, just... data.

Just data, man.

An API arbitration aberratically assigned at authorized access, abridged and annotated, analytically assuring absolute assurance.

show 7 replies
ai_fry_ur_brainyesterday at 9:13 PM

Ahh yes slop at the speed of light, how useful!

show 1 reply