LLMs can't really "see", so I challenge you to draw a pelican on a bike without any visual feedback, just code. Because that is how they are doing it.
Vision tokens for transformers aren't really well solved yet, which is why they can smash a phd math problem and trip over a "count the cats on the chair" problem.
It's not about seeing. It's about identifying the legs of the Pelican and then transferring the concept and mechanics of riding a bicycle + geometry of a body and a bicycle. The entire task has also nothing to do with vision tokens.