Have you tried anything with https://codeberg.org/ikawrakow/illama
https://github.com/ikawrakow/ik_llama.cpp and their 4Bit-quants?
Or maybe even Microsofts Bitnet? https://github.com/microsoft/BitNet
https://github.com/ikawrakow/ik_llama.cpp/pull/337
https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf ?
That would be an interesting comparison for running local LLMs on such low-end/edge-devices. Or common office machines with only iGPU.