logoalt Hacker News

Jimmc414yesterday at 6:57 PM6 repliesview on HN

We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data. It would be a great way to ensure an initial thumbs up from a prominent reviewer. It's a good benchmark but it seems like it would be a good idea to include an additional random or unannounced similar test to catch any benchmaxxing.


Replies

simonwyesterday at 7:19 PM

I wrote about that possibility here: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

show 1 reply
Workaccount2today at 12:28 AM

It would be easy to out models that train on the bike pelican, because they would probably suck at the kayaking bumblebee.

So far though, the models good at bike pelican are also good at kayak bumblebee, or whatever other strange combo you can come up with.

So if they are trying to benchmaxx by making SVG generation stronger, that's not really a miss, is it?

show 1 reply
0cf8612b2e1etoday at 12:26 AM

I assume all of the models also have variations on, “how many ‘r’s in strawberry”.

thatwasunusualyesterday at 11:51 PM

> We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data.

I may be stupid, but _why_ is this prompt used as a benchmark? I mean, pelicans _can't_ ride a bicycle, so why is it important for "AI" to show that they can (at least visually)?

The "wine glass problem"[0] - and probably others - seems to me to be a lot more relevant...?

[0] https://medium.com/@joe.richardson.iii/the-curious-case-of-t...

show 2 replies
th0ma5yesterday at 7:32 PM

If this had any substance then it could be criticized, which is what they're trying to avoid.

show 1 reply