logoalt Hacker News

randomtoastyesterday at 10:57 PM4 repliesview on HN

0.2 tok/s is fine for experimentation, but it is not interactive in any meaningful sense. For many use cases, a well-quantized 8B or 13B that stays resident will simply deliver a better latency-quality tradeoff


Replies

Wuzadoyesterday at 11:57 PM

I can imagine a couple scenarios in which a high-quality, large model would be much preferred over lower latency models, primarily when you need the quality.

xaskasdftoday at 1:11 AM

yeah, actually I wanted to see if this was possible at all. I managed to get around 3000 tokens/s on a ps2 with classic transformers, since the emotion engine is capable of 32 bit addresses, but it has like 32gb of ram. So I ran into the question of why was that fast and I couldn't get that speed even with small models, and the deal is that the instructions went right of the memory to the gpu and that's the main difference that does when a regular computer does inference: it has to request the instructions to the cpu every time. As I mentioned too, on professional cards you can avoid these problems naturally, since they got instructions precisely for this, but sadly I don't have 30k bucks to spare on a gpu :(

show 2 replies
fluoridationtoday at 3:56 AM

That's slower than just running it off CPU+GPU. I can easily hit 1.5 tokens/s on a 7950X+3090 and a 20480-token context.

tyfonyesterday at 11:30 PM

I didn't really understand the performance table until I saw the top ones were 8B models.

But 5 seconds / token is quite slow yeah. I guess this is for low ram machines? I'm pretty sure my 5950x with 128 gb ram can run this faster on the CPU with some layers / prefill on the 3060 gpu I have.

I also see that they claim the process is compute bound at 2 seconds/token, but that doesn't seem correct with a 3090?

show 1 reply