Very curious what hardware you're running this on!
The same 24GB VRAM RTX 4090 I bought to play Cyberpunk 2077 with.
Works perfectly fine in llama.cpp throwing 70+t/s at me with 128k q8 K/V context when using the IQ4_NL quant + MTP at q4 MTP K/V.
Also leaving this here because you might find it useful: https://hypfer.github.io/will-it-fit-llama-cpp/
The same 24GB VRAM RTX 4090 I bought to play Cyberpunk 2077 with.
Works perfectly fine in llama.cpp throwing 70+t/s at me with 128k q8 K/V context when using the IQ4_NL quant + MTP at q4 MTP K/V.
Also leaving this here because you might find it useful: https://hypfer.github.io/will-it-fit-llama-cpp/