Documents are processed as tokens as well, unless its bitmap is ocr'd.
Images tho are natively compatible with Multi-Modal LLMs, so theres no image->text translation layer in between.
It's that the unit of cost is different (e.g. "visual token" vs text token)
Documents are processed as tokens as well, unless its bitmap is ocr'd.
Images tho are natively compatible with Multi-Modal LLMs, so theres no image->text translation layer in between. It's that the unit of cost is different (e.g. "visual token" vs text token)