LLMs can't really "see", so I challenge you to draw a pelican on a bike without any v...

WarmWash • yesterday at 11:34 PM • 1 reply • view on HN

LLMs can't really "see", so I challenge you to draw a pelican on a bike without any visual feedback, just code. Because that is how they are doing it.

Vision tokens for transformers aren't really well solved yet, which is why they can smash a phd math problem and trip over a "count the cats on the chair" problem.

Replies

raffael_de • today at 8:28 AM

It's not about seeing. It's about identifying the legs of the Pelican and then transferring the concept and mechanics of riding a bicycle + geometry of a body and a bicycle. The entire task has also nothing to do with vision tokens.

alt Hacker News

Replies