We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelic...

Jimmc414 • yesterday at 6:57 PM • 6 replies • view on HN

We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data. It would be a great way to ensure an initial thumbs up from a prominent reviewer. It's a good benchmark but it seems like it would be a good idea to include an additional random or unannounced similar test to catch any benchmaxxing.

Replies

simonw • yesterday at 7:19 PM

I wrote about that possibility here: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

➕ show 1 reply

Workaccount2 • today at 12:28 AM

It would be easy to out models that train on the bike pelican, because they would probably suck at the kayaking bumblebee.

So far though, the models good at bike pelican are also good at kayak bumblebee, or whatever other strange combo you can come up with.

So if they are trying to benchmaxx by making SVG generation stronger, that's not really a miss, is it?

➕ show 1 reply

0cf8612b2e1e • today at 12:26 AM

I assume all of the models also have variations on, “how many ‘r’s in strawberry”.

thatwasunusual • yesterday at 11:51 PM

> We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data.

I may be stupid, but _why_ is this prompt used as a benchmark? I mean, pelicans _can't_ ride a bicycle, so why is it important for "AI" to show that they can (at least visually)?

The "wine glass problem"[0] - and probably others - seems to me to be a lot more relevant...?

[0] https://medium.com/@joe.richardson.iii/the-curious-case-of-t...

➕ show 2 replies

th0ma5 • yesterday at 7:32 PM

If this had any substance then it could be criticized, which is what they're trying to avoid.

➕ show 1 reply

alt Hacker News

Replies