I would try the Qwen models before LLaVa
Do you need the embeddings to be private? Or just the photos?
For photo indexing I'd run CLIP directly and save on compute, no need to use a whole language model.
For photo indexing I'd run CLIP directly and save on compute, no need to use a whole language model.