By perf I mean how much does it cost to serve 1T model to 1M users at 50 tokens/sec.

p1esk • yesterday at 8:04 PM • 1 reply • view on HN

Replies

All 1T models are not equal. E.g. how many active parameters? what's the native quantization? how long is the max context? Also, it's quite likely that some smaller models in common use are even sub-1T. If your model is light enough, the lower throughput doesn't necessarily hurt you all that much and you can enjoy the lightning-fast speed.

➕ show 2 replies

alt Hacker News

Replies