The bicycle frame is a bit wonky but the pelican itself is great: https://gist.github.com/simonw/a6806ce41b4c721e240a4548ecdbe...
One aspect of this is that apparently most people can't draw a bicycle much better than this: they get the elements of the frame wrong, mess up the geometry, etc.
here the animated version https://claude.ai/public/artifacts/3db12520-eaea-4769-82be-7...
If we do get paperclipped, I hope it is of the "cycling pelican" variety. Thanks for your important contribution to alignment Simon!
They trained for it. That's the +0.1!
Do you find that word choices like "generate" (as opposed to "create", "author", "write" etc.) influence the model's success?
Also, is it bad that I almost immediately noticed that both of the pelican's legs are on the same side of the bicycle, but I had to look up an image on Wikipedia to confirm that they shouldn't have long necks?
Also, have you tried iterating prompts on this test to see if you can get more realistic results? (How much does it help to make them look up reference images first?)
Isn't there a point at which it trains itself on these various outputs, or someone somewhere draws one and feeds it into the model so as to pass this benchmark?
This benchmark inspired me to have codex/claude build a DnD battlemap tool with svg's.
They got surprisingly far, but i did need to iterate a few times to have it build tools that would check for things like; dont put walls on roads or water.
What I think might be the next obstacle is self-knowledge. The new agents seem to have picked up ever more vocabulary about their context and compaction, etc.
As a next benchmark you could try having 1 agent and tell it to use a coding agent (via tmux) to build you a pelican.
Well, the clouds are upside-down, so I don't think I can give it a pass.
This really is my favorite benchmark
I suppose the pelican must be now specifically trained for, since it's a well-known benchmark.
I'm firing all of my developers this afternoon.
best pelican so far would you say? Or where does it rank in the pelican benchmark?
What about the Pelo2 benchmark? (the gray bird that is not gray)
do you have a gif? i need an evolving pelican gif
Pretty sure at this point they train it on pelicans
The ears on top are a cute touch
[dead]
Would love to find out they're overfitting for pelican drawings.