I don't think this is what is happening, IMO. The models can genuinely "read" the text off the images, but usually at a less-than-perfect ratio, and it uses less tokens for the model on visual input than it does actually using OCR to convert them into text and then sending that in. I do not think there is any intermediate stage where they are applying a free OCR in this situation. (I realize that happens in some situations)