Performance of LLM inference consists of two independent metrics - prompt processing (compute intens...

BoredomIsFun • today at 12:12 PM • 0 replies • view on HN

Performance of LLM inference consists of two independent metrics - prompt processing (compute intensive) and token generation (bandwidth intensive). For autocomplete with 1.5B you can get away with abysmal 10 t/s token generation performance, but you'd want as fast as possible prompt processing, pi in incapable of.

alt Hacker News