This is getting very close to fit a single 3090 with 24gb VRAM :)

vladgur • today at 3:17 PM • 3 replies • view on HN

Replies

Yup! Smaller quants will fit within 24GB but they might sacrifice context length.

I’m excited to try out the MLX version to see if 32GB of memory from a Pro M-series Mac can get some acceptable tok/s with longer context. HuggingFace has uploaded some MLX versions already.

➕ show 2 replies

skiing_crawling • today at 6:42 PM

I used to run qwen3.5 27b Q4_k_M on a single 3090 with these llama-server flags successfully: `-ngl 99 -c 262144 -fa on --cache-type-k q4_0 --cache-type-v q4_0`

GaggiX • today at 3:25 PM

At 4-bit quantization it should already fit quite nicely.

➕ show 1 reply

alt Hacker News

Replies