logoalt Hacker News

WarmWashyesterday at 11:34 PM1 replyview on HN

LLMs can't really "see", so I challenge you to draw a pelican on a bike without any visual feedback, just code. Because that is how they are doing it.

Vision tokens for transformers aren't really well solved yet, which is why they can smash a phd math problem and trip over a "count the cats on the chair" problem.


Replies

raffael_detoday at 8:28 AM

It's not about seeing. It's about identifying the legs of the Pelican and then transferring the concept and mechanics of riding a bicycle + geometry of a body and a bicycle. The entire task has also nothing to do with vision tokens.