> we demonstrated running gpt-oss-120b on two RNGD chips [snip] at 5.8 ms per output token That...

nl • today at 2:22 AM • 2 replies • view on HN

> we demonstrated running gpt-oss-120b on two RNGD chips [snip] at 5.8 ms per output token

That's 86 token/second/chip

By comparison, a H100 will do 2390 token/second/GPU

Am I comparing the wrong things somehow?

[1] https://inferencemax.semianalysis.com/

Replies

I think you are comparing latency with throughput. You can't take the inverse of latency to get throughput because concurrency is unknown. But then, RNGD result is probably with concurrency=1.

binary132 • today at 2:28 AM

I thought they were saying it was more efficient, as in tokens per watt. I didn’t see a direct comparison on that metric but maybe I didn’t look well enough.

➕ show 1 reply

alt Hacker News

Replies