Did you double the tokens per second by adding a second GPU or was the increase significantly less?

gonzalohm • today at 6:19 PM • 2 replies • view on HN

Replies

No real change in inference speed. It basically just allows me to slot in more context or a bigger model.

A single RTX-3090 will do approximately the same tok/s, but it won't fit the entire 300k context in VRAM.

Sometimes that matters, a lot of times it doesn't.

On the speed front - MOE models are great. Biggest perf difference in modern models is the move to MOE architectures.

I get very similar quality from the both the Gemma-4 31B dense model, and the Gemma-4 26B MOE model (both at Q4 quant) but the MOE version runs at ~3 times the speed (150tok/s vs 46tok/s).

mirekrusin • today at 6:25 PM

You’re adding extra gpu for more vram, not speed.

alt Hacker News

Replies