logoalt Hacker News

mwcampbellyesterday at 5:28 PM4 repliesview on HN

I invested about $4,000 in an NVIDIA DGX Spark several months ago. 128 GB of unified RAM, and the NVIDIA GB10 chip. With the RAM, the several CPU cores, and the 4 TB NVMe SSD, it's a very capable ARM64 Linux computer even without the GPU, and so far I've mostly been using it as such. But I wonder, what's the most capable model, specifically for coding, that can run well on that hardware?


Replies

lee_arsyesterday at 11:18 PM

I'm currently working through research and testing for an article on Ars about the Spark and what things one might do with it, and I've kind of stumbled into a two-LLM agentic setup with Qwen3.6-35B-A3B (via nvidia/Qwen3.6-35B-A3B-NVFP4) as the planning agent and the FP8 version of Qwen3-Coder-30B-A3B-Instruct (Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8) as the coding agent that the planner delegates tasks down to. I'm sticking with vLLM as the inference engine, and I've got it wired together into a 2-agent loop with Opencode.

The Qwen3.6-35B-A3B planner hums along at 50-55 tokens/s, and the Qwen3-Coder-30B-A3B-Instruct coder does 30-35. With both agents up and ready to work, RAM consumption sits at about 112 of 128GB.

It's pretty okay. I'm faffing around with having it disassemble old MS-DOS games from the 1980s, which is a task that lends itself well to the setup. It's not the fastest thing in the world, but with the planner's context window at 256k tokens and the coding agent at 128k, they chew through pretty long task lists handing things back and forth without complaint. The only real issue is that even with really tightly scoped prompts, the coding agent tends to hallucinate like it's on LSD. But the planning agent appears to be quite good at spotting the hallucinations and re-parceling work back to the coder.

It's neat. I'm going to be sad when I have to return the review unit in a couple of months.

edit - I also have been fiddling with Deepseek v4 Flash via Antirez's setup (https://github.com/antirez/ds4), and it's pretty fantastic (and fantastically easy to get running). It's pretty pokey on the Spark, though, at 14-ish tokens/sec. And unless you have a second Spark, it's going to be the only model you run at one time, as it eats alllll the rams.

show 1 reply
anon373839yesterday at 11:18 PM

Qwen 3.5 122B can fit with context at a pretty high quant (Q6). That's an excellent model.

Yoricyesterday at 5:57 PM

https://www.canirun.ai/?status=tight might answer that question

show 1 reply
morganastrayesterday at 6:03 PM

Deepseek v4 flash is shockingly strong for its size and reportedly runs well on that hardware.

show 1 reply