Author here. We were hitting tail latency and low GPU utilization issues serving SLMs via Triton. ...

lneiman • last Tuesday at 5:39 PM • 1 reply • view on HN

Author here. We were hitting tail latency and low GPU utilization issues serving SLMs via Triton.

I built a scrappy client-side router using Redis and Lua to track real-time GPU load. It boosted utilization by ~40% and improved latencies.

Happy to hear feedback on the implementation or thoughts on better ways to do this!

Replies

pbrumm • last Saturday at 3:28 PM

Have you tried switching it to a job queue where the GPU instances try to keep themselves busy. That way you can auto scale the gpus based on utilization. I find it easier to tune and you can monitor latency and backlogs easier. It does require some async mechanisms to the client but I have found it easier to maintain

alt Hacker News

Replies