Tangentially related: I don't think OCR is the right term and I am generally vocal about that. But seeing this unquestioned here, I am wondering if I am the one who is wrong here. Is it ok to call this OCR? To me ocr means text in the end, not visual tokens.
It's not. OCR is not what the vision model is doing here. We're used to using OCR as a verb but it's more accurate to say the model "visioned" it.
Also, some models still do OCR and it's usually way more expensive that way.
So if I OCR a document, edit it, and print it, OCR didn't happen?
OCR means optical character recognition. The terms do not require a direct transcription, but that is mostly what OCR meant in the past. If you’re using an LLM’s vision capability to pass in text and the LLM actually understands it, then I would say that it recognized the characters, hence OCR seems okay to use.