There's no way the red v2 is doing anything with a 120b parameter model. I just finished buildi...

bastawhiz • yesterday at 8:53 PM • 6 replies • view on HN

There's no way the red v2 is doing anything with a 120b parameter model. I just finished building a dual a100 ai homelab (80gb vram combined with nvlink). Similar stats otherwise. 120b only fits with very heavy quantization, enough to make the model schizophrenic in my experience. And there's no room for kv, so you'll OOM around 4k of context.

I'm running a 70b model now that's okay, but it's still fairly tight. And I've got 16gb more vram then the red v2.

I'm also confused why this is 12U. My whole rig is 4u.

The green v2 has better GPUs. But for $65k, I'd expect a much better CPU and 256gb of RAM. It's not like a threadripper 7000 is going to break the bank.

I'm glad this exists but it's... honestly pretty perplexing

Replies

oceanplexian • yesterday at 8:58 PM

It will work fine but it’s not necessarily insane performance. I can run a q4 of gpt-oss-120b on my Epyc Milan box that has similar specs and get something like 30-50 Tok/sec by splitting it across RAM and GPU.

The thing that’s less useful is the 64G VRAM/128G System RAM config, even the large MoE models only need 20B for the router, the rest of the VRAM is essentially wasted (Mixing experts between VRAM and/System RAM has basically no performance benefit).

➕ show 1 reply

overfeed • yesterday at 11:03 PM

> I'm also confused why this is 12U. My whole rig is 4u.

I imagine that's because they are buying a single SKU for the shell/case. I imagine their answer to your question would be: In order to keep prices low and quality high, we don't offer any customization to the server dimensions

➕ show 1 reply

Aurornis • today at 2:35 AM

> There's no way the red v2 is doing anything with a 120b parameter model.

I don't see the 120B claim on the page itself. Unless the page has been edited, I think it's something the submitter added.

I agree, though. The only way you're running 120B models on that device is either extreme quantization or by offloading layers to the CPU. Neither will be a good experience.

These aren't a good value buy unless you compare them to fully supported offerings from the big players.

It's going to be hard to target a market where most people know they can put together the exact same system for thousands of dollars less and have it assembled in an afternoon. RTX 6000 96GB cards are in stock at Newegg for $9000 right now which leaves almost $30,000 for the rest of the system. Even with today's RAM prices it's not hard to do better than that CPU and 256GB of RAM when you have a $30,000 budget.

ericd • today at 12:10 AM

Was that cheaper than a Blackwell 6000?

But yeah, 4x Blackwell 6000s are ~32-36k, not sure where the other $30k is going.

➕ show 2 replies

zozbot234 • yesterday at 9:13 PM

> And there's no room for kv, so you'll OOM around 4k of context.

Can't you offload KV to system RAM, or even storage? It would make it possible to run with longer contexts, even with some overhead. AIUI, local AI frameworks include support for caching some of the KV in VRAM, using a LRU policy, so the overhead would be tolerable.

➕ show 3 replies

ottah • today at 12:38 AM

Honestly two rtx 8000s would probably have a better return on investment than the red v2. I have an eight gpu server, five rtx 8000, three rtx 6000 ada. For basic inference, the 8000s aren't bad at all. I'm sure the green with four rtx pro 6000s are dramatically faster, but there's a $25k markup I don't honestly understand.

alt Hacker News

Replies