It's not a benchmark though, right? Because there's no control group or reference. It&#x...

ljm • yesterday at 10:25 PM • 2 replies • view on HN

It's not a benchmark though, right? Because there's no control group or reference.

It's just an experiment on how different models interpret a vague prompt. "Generate an SVG of a pelican riding a bicycle" is loaded with ambiguity. It's practically designed to generate 'interesting' results because the prompt is not specific.

It also happens to be an example of the least practical way to engage with an LLM. It's no more capable of reading your mind than anyone or anything else.

I argue that, in the service of AI, there is a lot of flexibility being created around the scientific method.

Replies

tylervigen • yesterday at 10:32 PM

For 2026 SOTA models I think that is fair.

For the last generation of models, and for today's flash/mini models, I think there is still a not-unreasonable binary question ("is this a pelican on a bicycle?") that you can answer by just looking at the result: https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/

interstice • yesterday at 10:36 PM

So if it can generate exactly what you had in mind based presumably on the most subtle of cues like your personal quirks from a few sentences that could be _terrifying_, right?

alt Hacker News

Replies