Qwen3.5 9b seems to be fairly competent at OCR and text formatting cleanup running in llama.cpp on CPU, albeit slow. However, I have compiled it umpteen ways and still haven't gotten GPU offloading working properly (which I had with Ollama), on an old 1650 Ti with 4GB VRAM (it tries to allocate too much memory).
If you’re building from source, the vulkan backend is the easiest to build and use for GPU offloading.
I have a 1660ti and the cachyos + aur/llama.cpp-cuda package is working fine for me. With about 5.3 GB of usable memory, I find that the 35B model is by far the most capable one that performs just as fast as the 4B model that fits entirely on my GPU. I did try the 9B model and was surprisingly capable. However 35B still better in some of my own anecdotal test cases. Very happy with the improvement. However, I notice that qwen 3.5 is about half the speed of qwen 3