I mean this is exactly what it is. Just a wrapper to replace the tokenizer. That is exactly how LLMs can read images.
I'm just focusing on different parts