logoalt Hacker News

embedding-shapeyesterday at 7:11 PM1 replyview on HN

RTX Pro 6000, ends up taking ~66GB when running the MXFP4 native quant with llama-server/llama.cpp and max context, as an example. Guess you could do it with two 5090s with slightly less context, or different software aimed at memory usage efficiency.


Replies

kristianpyesterday at 10:06 PM

That has 96GB GDDR7 ECC, to save people looking it up.