Author here. We were hitting tail latency and low GPU utilization issues serving SLMs via Triton.
I built a scrappy client-side router using Redis and Lua to track real-time GPU load. It boosted utilization by ~40% and improved latencies.
Happy to hear feedback on the implementation or thoughts on better ways to do this!
If I understand the article correctly, any sufficiently capable attacker can:
- Know the global state of your GPU cluster via the client.
- Target the most struggling GPU instances specifically since the client decides which one to hit.
You offer a free tier which means anyone can get an account and try to do it (e.g. you can have one "harmless, mostly inactive" free account with the only purpose of retrieving GPU cluster status, and a bunch of burner accounts to overload struggling instances.
I may be completely wrong, but this sounds like DDoS served on a silver plate to me.