I suspect DwarfStar could probably squeeze more performance out of the single spark, maybe up closer to 20tok/s.
Moving to 2 sparks meant switching to vLLM with 2-way tensor parallelism and working multi-token prediction. The parallelism and MTP on top of better tuned kernels[1] gave an extremely nice boost! I was quite pleased. I've seen bursts up to 60tok/s at ~150k context - sometimes the MTP seems to really kick in (i.e. high acceptance rate on its tokens)
Currently running a custom vLLM build put together by some folks on the Nvidia forums[2], which speaks to how early support for the model is.
I suspect DwarfStar could probably squeeze more performance out of the single spark, maybe up closer to 20tok/s.
Moving to 2 sparks meant switching to vLLM with 2-way tensor parallelism and working multi-token prediction. The parallelism and MTP on top of better tuned kernels[1] gave an extremely nice boost! I was quite pleased. I've seen bursts up to 60tok/s at ~150k context - sometimes the MTP seems to really kick in (i.e. high acceptance rate on its tokens)
Currently running a custom vLLM build put together by some folks on the Nvidia forums[2], which speaks to how early support for the model is.
[1]: https://github.com/lukealonso/b12x
[2]: https://forums.developer.nvidia.com/t/372268