logoalt Hacker News

binyutoday at 3:16 PM1 replyview on HN

Personally, I've tried to squeeze more tok/s for a single DGX Spark deployment and DeepSeek V4 Flash but only got marginal improvements. There's work to do on fusing kernels and other optimizations that are already on antirez's roadmap so it is not worth duplicating efforts.

I've had positive experiences running GLM 4.7 via vLLM, tool calling works well and the inference is fast. Do you run DeepSeek V4 Flash on vLLM?


Replies

wolttamtoday at 3:22 PM

Yep, those are the numbers I'm getting with DSv4 Flash on vLLM across 2 sparks.