I've just tried replicating this on my Pi 5 16GB, running the latest llama.cpp... and it segfaults:
./build/bin/llama-cli -m "models/Qwen3-30B-A3B-Instruct-2507-Q3_K_S-2.70bpw.gguf" -e --no-mmap -t 4
...
Loading model... -ggml_aligned_malloc: insufficient memory (attempted to allocate 24576.00 MB)
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 25769803776
alloc_tensor_range: failed to allocate CPU buffer of size 25769803776
llama_init_from_model: failed to initialize the context: failed to allocate buffer for kv cache
Segmentation fault
I'm not sure how they're running it... any kind of guide for replicating their results? It does take up a little over 10 GB of RAM (watching with btop) before it segfaults and quits.[Edit: had to add -c 4096 to cut down the context size, now it loads]
Would you be able to actually get useful results from it? I'm looking into self hosting LLM's for python/js development. But I dont know if I would get useful results.
Have you tried anything with https://codeberg.org/ikawrakow/illama
https://github.com/ikawrakow/ik_llama.cpp and their 4Bit-quants?
Or maybe even Microsofts Bitnet? https://github.com/microsoft/BitNet
https://github.com/ikawrakow/ik_llama.cpp/pull/337
https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf ?
That would be an interesting comparison for running local LLMs on such low-end/edge-devices. Or common office machines with only iGPU.
Tested same model on Intel N100 miniPC with 16G - the hundred bucks pc
llama-server -m /Qwen3-30B-A3B-Instruct-2507-GGUF:IQ3_S --jinja -c 4096 --host 0.0.0.0 --port 8033 Got <= 10 t/s Which I think is not so bad!
On AMD Ryzen 5 5500U with Radeon Graphics and Compiled for Vulkan Got 15 t/s - could swear this morning was <= 20 t/s
On AMD Ryzen 7 H 255 w/ Radeon 780M Graphics and Compiled for Vulkan Got 40 t/s On the last I did a quick comparison with unsloth version unsloth/Qwen3-30B-A3B-GGUF:Q4_K_M and got 25 t/s Can't really comment on quality of output - seems similar