Documents are processed as tokens as well, unless its bitmap is ocr'd. Images tho are nativel...

mrbnprck • yesterday at 9:19 PM • 1 reply • view on HN

Documents are processed as tokens as well, unless its bitmap is ocr'd.

Images tho are natively compatible with Multi-Modal LLMs, so theres no image->text translation layer in between. It's that the unit of cost is different (e.g. "visual token" vs text token)

Replies

electrotype • yesterday at 9:29 PM

I see. I was thinking that it might be different if the document wasn't provided by you directly, but instead if the LLM fetched it itself online.

alt Hacker News

Replies