For Intel CPUs, Phi-2 (2.7B) and TinyLlama (1.1B) run reasonably well using llama.cpp with 4-bit qua...

ethan_smith • last Saturday at 4:49 PM • 1 reply • view on HN

For Intel CPUs, Phi-2 (2.7B) and TinyLlama (1.1B) run reasonably well using llama.cpp with 4-bit quantization. GGUF models with INT4 quantization typically need ~2GB RAM per billion parameters, so even older machines can handle smaller models.

Replies

akawry • last Sunday at 1:23 PM

Take a look at ik_llama.cpp: https://github.com/ikawrakow/ik_llama.cpp

CPU performance is much better than mainline llama, as well as having more quantization types available

alt Hacker News

Replies