This is very interesting and yet not at the same time. This looks to be optimized for single-stream ...

beffjezos • today at 3:42 AM • 2 replies • view on HN

This is very interesting and yet not at the same time. This looks to be optimized for single-stream LLM traffic which is not viable to serve in a production setting. It's only interesting to hobbyists that want to run the model locally.

It's genuinely neat that AI can find the right optimization pathways in an AMD inference server to unlock this but at the same token (pun-intended) this is a classic case of benchmark hacking that doesn't stand up to real-world application.

Replies

wmf • today at 3:45 AM

You got it backwards; it's ~200 on single stream so the 2,600 is achieved with ~13 streams.

➕ show 1 reply

technoabsurdist • today at 3:47 AM

hi yes it’s not optimized for single stream it’s optimized for total node throughput

➕ show 1 reply

alt Hacker News

Replies