logoalt Hacker News

gonzalohmtoday at 6:19 PM2 repliesview on HN

Did you double the tokens per second by adding a second GPU or was the increase significantly less?


Replies

horsawlarwaytoday at 6:37 PM

No real change in inference speed. It basically just allows me to slot in more context or a bigger model.

A single RTX-3090 will do approximately the same tok/s, but it won't fit the entire 300k context in VRAM.

Sometimes that matters, a lot of times it doesn't.

On the speed front - MOE models are great. Biggest perf difference in modern models is the move to MOE architectures.

I get very similar quality from the both the Gemma-4 31B dense model, and the Gemma-4 26B MOE model (both at Q4 quant) but the MOE version runs at ~3 times the speed (150tok/s vs 46tok/s).

mirekrusintoday at 6:25 PM

You’re adding extra gpu for more vram, not speed.