I generated pelicans riding bicycles on both thinking level low and thinking level high:

simonw • today at 5:06 PM • 18 replies • view on HN

https://gist.github.com/simonw/68560eddb0b268a8417f80ceb7304...

The high one is notably better - the bicycle frame is the correct shape, unlike thinking level low.

For comparison, here's Opus 4.7: https://gist.github.com/simonw/afcb19addf3f38eb1996e1ebe749c...

Replies

> the bicycle frame is the correct shape

No, the handlebar is wrong. The handle bar is rotating the frame instead of rotating the front wheel. The handle bar should be mounted on the same line as the front wheel is.

Hopefully 4.9 will read my comments :)

➕ show 1 reply

eminence32 • today at 8:30 PM

I bet someone shares this link every time you post about bicycles, but since I didn't see anyone share it yet in this thread, I'll take the opportunity to do so:

https://www.gianlucagimini.it/portfolio-item/velocipedia/

Turns out even humans can be pretty bad at drawing bicycles :)

➕ show 3 replies

impalallama • today at 9:38 PM

I actually like the 4.7 the most, interestingly enough. Not like you can "objectively" weight artistic output like this.

simonw • today at 7:46 PM

Here's pelicans in all of the thinking levels - low, medium, high, xhigh, max

https://tools.simonwillison.net/markdown-svg-renderer#url=ht...

➕ show 2 replies

jonas21 • today at 5:20 PM

Glad to see that the "high thinking" level adds a helmet. Always a smart choice.

spmartin823 • today at 5:32 PM

You've peed in the pool Simon, this has to be a part of the internal evals by now! You got to try something new - maybe a panda in a canoe?

➕ show 2 replies

ceroxylon • today at 5:33 PM

I really like that thinking level high gave the pelican a helmet.

Xunjin • today at 5:23 PM

Hey simonw I love your test, do you think using thinking level "max" makes sense for this test? I would love to see the results about it.

➕ show 1 reply

toastmaster11 • today at 6:12 PM

I find the most miraculous thing about 4.7 to be that the pelican is facing left, wonder why the right facing everything is so ubiquitous in these images.

➕ show 3 replies

yanis_t • today at 5:15 PM

Simon, is your pelican test really captures differences among models or should you at least try like 10 times or something to average the random effects

➕ show 1 reply

silisili • today at 6:53 PM

The vast majority (if not all) of these make it impossible to turn, among other fun things. Only out of curiosity, have you tried prompting further with how a bike must operate to see if it does the right thing?

fragmede • today at 8:47 PM

For comparison, what's GPT-5.5 producing today?

➕ show 1 reply

timsuchanek • today at 6:00 PM

thanks for always providing this very much on time. I'm wondering what the next, harder challenge could be? Maybe some animated svg?

1attice • today at 5:18 PM

That little red hat on hard mode is sending me. 4.8 has whimsy

nickvec • today at 5:11 PM

Is the "opossum riding an e-scooter" benchmark in the works for Opus 4.8? ;)

➕ show 2 replies

whalesalad • today at 6:43 PM

Eventually the frontier model folks are going to pick up on your pelican on a bike test and bake-in flawless results for that particular request.

highwaylights • today at 6:07 PM

Am I allowed to say that pelican's little helmet is adorable? I can't provide a strong computational proof, or even a shred of anecdata...

...but that pelican's little helmet is adorable.

onlyrealcuzzo • today at 5:09 PM

4.7 reigns supreme IMO.

alt Hacker News

Replies