logoalt Hacker News

storystarlingtoday at 10:32 AM1 replyview on HN

Curious how you handle sharding and KV cache pressure for a 120b model. I guess you are doing tensor parallelism across consumer cards, or is it a unified memory setup?


Replies

embedding-shapetoday at 10:49 AM

I don't, fits on my card with the full context, I think the native MXFP4 weights takes ~70GB of VRAM (out of 96GB available, RTX Pro 6000), so I still have room to spare to run GPT-OSS-20B alongside for smaller tasks too, and Wayland+Gnome :)

show 1 reply