Q4_K_M is the quantization most models are ideal at, which is about 4.5 bits per parameter. So take ...

svachalek • 01/21/2025 • 0 replies • view on HN

Q4_K_M is the quantization most models are ideal at, which is about 4.5 bits per parameter. So take the number of parameters and multiply by 4.5/8 and that's how much RAM you need to load the model. Then add some for context and processing. Short answer, any of the distilled models will run easily, but you still can't touch the raw one.

alt Hacker News