The problem is using a language model to assess images.
Probably 80% of "LLM's are below expectation" complaints (from the general population) involves some form of image analyses.
Image tokenization is hard because unlike language tokenization, where every token is extremely dense with meaning, image tokens tends to be meaningless or irrelevant but are processed all the same.
Give an SOTA LLM a picture of toothpicks and ask it to move one to make a square, and it will probably struggle and fumble it. But give a mid-size LLM from 2 years ago the same problem in verbal form, and it will nail it almost every time.
That takeaway is, do everything you can to avoid having the LLM need to rely on images for the answer.
I thought all the recent models are "multimodal"? Is the image part just sticking an image recognizer in front of the text model?