There are some spectacular local models for generating text descriptions of images now. I suggest starting with Mistral Small 3.2, Gemma 3 and Qwen 2.5VL - all available via Ollama.
I expect we will see a Qwen 3VL soon.