How should I update my simplistic understanding that decode is bw-bound with these results that show...

jbellis • yesterday at 11:09 PM • 1 reply • view on HN

How should I update my simplistic understanding that decode is bw-bound with these results that show the B70 decoding faster than a 4090 (about 50% more bw)?

Replies

rao-v • yesterday at 11:54 PM

I doubt you'd get the same sort of result on a modern-ish MOE or dense model via a more standard inference engine like llama.cpp or VLLM. I don't think MLPerf is a reasonable benchmark at this point.

Edit: Here is a simple llama.cpp compare where the token gen results match the rule of thumb.

https://www.reddit.com/r/LocalLLaMA/comments/1st6lp6/nvidia_...

alt Hacker News

Replies