Of course it isn't
A text encoding uses 8bits per character on average, tokenization further compresses that
An image font would be 25 bits if 5x5, and most fonts are 12 pixels high
Of course it isn't efficient, this is a pricing inefficiency and a hack to exploit it (even the author describes it as an exploit)
huh, what if the image encoding is 8 bits per R, G, B values of the pixel, then one can encode the same amount of text in less pixel dimensions (3 letters would need 1 pixel instead of three 12x12 pixels)
The top line can be the OCR-able instruction on how to decode the rest of the image, and the rest of the image would be random-looking colourful palette. It might not even need to use 8 bits per character, since ANSI is 7 bits/character.
[dead]
You are wrong.
Text tokens are high-dimensional vectors, not 8 bits per character. Every token has a deep embedding, e.g. 1024 float values per text token.
DeepSeek-OCR proved 10x+ compression from visual embedding of text, which was a groundbreaking result. [1]
Very cool to see OP's project hacking on this principle. It's still not lossless, as noted in the github, but is a promising research direction.
[1] https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSe...