It's not. OCR is not what the vision model is doing here. We're used to using OCR as a verb but it's more accurate to say the model "visioned" it.
Also, some models still do OCR and it's usually way more expensive that way.