logoalt Hacker News

p1eskyesterday at 8:04 PM1 replyview on HN

By perf I mean how much does it cost to serve 1T model to 1M users at 50 tokens/sec.


Replies

zozbot234yesterday at 8:47 PM

All 1T models are not equal. E.g. how many active parameters? what's the native quantization? how long is the max context? Also, it's quite likely that some smaller models in common use are even sub-1T. If your model is light enough, the lower throughput doesn't necessarily hurt you all that much and you can enjoy the lightning-fast speed.

show 2 replies