The pelican has looked very same-y across all frontier models, same color bike, same camera angle, etc. I suspect this challenge is already too embedded in the training data to be a good signal when it succeeds, and maybe even when it fails in pathological ways mirroring existing AI pelicans on the internet.
The "big beak!" comment in the svg source makes me think it's definitely a gamed "benchmark" at this point.
Was it ever a good test? How do you even objectively assess what a good pelican on a bike is anyway?
Do you think the models are ready for the next level? I believe that would be: Pelican feeding Spaghetti to Will Smith.
I'd be very surprised if this is in the training data given that most models mess it up to this day. E.g. look at the ones from Opus.
Variations of this comment have been posted for over a year. The pelican has now morphed into part of HN culture rather than a legitimate benchmark, but it's still valuable as a meme.
I'd say it's working great for its intended purpose. Keeps Simon on top of all these threads and funnels traffic to his site.