Author here. We were hitting tail latency and low GPU utilization issues serving SLMs via Triton.
I built a scrappy client-side router using Redis and Lua to track real-time GPU load. It boosted utilization by ~40% and improved latencies.
Happy to hear feedback on the implementation or thoughts on better ways to do this!
Have you tried switching it to a job queue where the GPU instances try to keep themselves busy. That way you can auto scale the gpus based on utilization. I find it easier to tune and you can monitor latency and backlogs easier. It does require some async mechanisms to the client but I have found it easier to maintain