logoalt Hacker News

binyutoday at 3:00 PM1 replyview on HN

> Now at 40-50tok/s generation and ~2000 tok/s

Not clear how you went from ~11-14 to ~40-50 tok/s. Is it by running the quant native model and adding a second Spark?

Cheers


Replies

wolttamtoday at 3:10 PM

I suspect DwarfStar could probably squeeze more performance out of the single spark, maybe up closer to 20tok/s.

Moving to 2 sparks meant switching to vLLM with 2-way tensor parallelism and working multi-token prediction. The parallelism and MTP on top of better tuned kernels[1] gave an extremely nice boost! I was quite pleased. I've seen bursts up to 60tok/s at ~150k context - sometimes the MTP seems to really kick in (i.e. high acceptance rate on its tokens)

Currently running a custom vLLM build put together by some folks on the Nvidia forums[2], which speaks to how early support for the model is.

[1]: https://github.com/lukealonso/b12x

[2]: https://forums.developer.nvidia.com/t/372268

show 2 replies