I'm learning about inference by running vLLM on a k8s cluster (EKS), building a gateway to keep a <2s TTFT SLO.
Most recent ha-ha moment: I kept wondering if it was normal that my cluster was only able to process 4 requests per second per vLLM engine (just seemed really low to me).
I realized a better metric is in-flight requests... Each engine is processing 70 requests at any given time, streaming tokens for over 30s.
Deeper dives into those uncover interesting limitations that don't seem to be documented anywhere. On the other hand, it is through those reverse shibboleths that I am now able to tell that my boss's boss has no idea what he is talking about llm-wise.