This is very interesting and yet not at the same time. This looks to be optimized for single-stream LLM traffic which is not viable to serve in a production setting. It's only interesting to hobbyists that want to run the model locally.
It's genuinely neat that AI can find the right optimization pathways in an AMD inference server to unlock this but at the same token (pun-intended) this is a classic case of benchmark hacking that doesn't stand up to real-world application.
hi yes it’s not optimized for single stream it’s optimized for total node throughput
You got it backwards; it's ~200 on single stream so the 2,600 is achieved with ~13 streams.