logoalt Hacker News

vladgurtoday at 3:17 PM3 repliesview on HN

This is getting very close to fit a single 3090 with 24gb VRAM :)


Replies

originalvichytoday at 3:24 PM

Yup! Smaller quants will fit within 24GB but they might sacrifice context length.

I’m excited to try out the MLX version to see if 32GB of memory from a Pro M-series Mac can get some acceptable tok/s with longer context. HuggingFace has uploaded some MLX versions already.

show 2 replies
skiing_crawlingtoday at 6:42 PM

I used to run qwen3.5 27b Q4_k_M on a single 3090 with these llama-server flags successfully: `-ngl 99 -c 262144 -fa on --cache-type-k q4_0 --cache-type-v q4_0`

GaggiXtoday at 3:25 PM

At 4-bit quantization it should already fit quite nicely.

show 1 reply