Personally, I've tried to squeeze more tok/s for a single DGX Spark deployment and DeepSee...

binyu • today at 3:16 PM • 1 reply • view on HN

Personally, I've tried to squeeze more tok/s for a single DGX Spark deployment and DeepSeek V4 Flash but only got marginal improvements. There's work to do on fusing kernels and other optimizations that are already on antirez's roadmap so it is not worth duplicating efforts.

I've had positive experiences running GLM 4.7 via vLLM, tool calling works well and the inference is fast. Do you run DeepSeek V4 Flash on vLLM?

Replies

wolttam • today at 3:22 PM

Yep, those are the numbers I'm getting with DSv4 Flash on vLLM across 2 sparks.

alt Hacker News

Replies