logoalt Hacker News

ethan_smithlast Saturday at 4:49 PM1 replyview on HN

For Intel CPUs, Phi-2 (2.7B) and TinyLlama (1.1B) run reasonably well using llama.cpp with 4-bit quantization. GGUF models with INT4 quantization typically need ~2GB RAM per billion parameters, so even older machines can handle smaller models.


Replies

akawrylast Sunday at 1:23 PM

Take a look at ik_llama.cpp: https://github.com/ikawrakow/ik_llama.cpp

CPU performance is much better than mainline llama, as well as having more quantization types available