Pretty great pelican: https://simonwillison.net/2026/Feb/19/gemini-31-pro/ - took over 5 minutes though, but I think that's because they're having performance teething problems on launch day.
What's crazy is you've influenced them to spend real effort ensuring their model is good at generating animated svgs of animals operating vehicles.
The most absurd benchmaxxing.
https://x.com/jeffdean/status/2024525132266688757?s=46&t=ZjF...
Does anyone understand why LLMs have gotten so good at this? Their ability to generate accurate SVG shapes seems to greatly outshine what I would expect, given their mediocre spatial understanding in other contexts.
Models are soon going to start benchmaxxing generating SVGs of pelicans on bikes
Another great benchmark would be to convert a raster image of a logo into SVG. I've yet to find a good tool for this that produces accurate smooth lines.
Cost per task has increased 4.2x but their ARC-AGI-2 score went from 33.6% to 77.1%
Cost per task is still significantly lower than Opus. Even Opus 4.5
It seems they trained the model to output good svg’s.
In their blog post[1], first use case they mention is svg generation. Thus, it might not be any indicator at all anymore.
[1] https://blog.google/innovation-and-ai/models-and-research/ge...
Did you stop using the more detailed prompt? I think you described it here: https://simonwillison.net/2025/Nov/18/gemini-3/
Less pretty and more practical, it's really good at outputting circuit designs as SVG schematics.
At this point, the pelican benchmark became so widely used that there must be high quality pelicans in the dataset, I presume. What about generating an okapi on a bicycle instead?
Ugh, the gears and chain don't mesh and there's no sprocket on the rear hub
But seriously, I can't believe LLMs are able to one-shot a pelican on a bicycle this well. I wouldn't have guessed this was going to emerge as a capability from LLMs 6 years ago. I see why it does now, but... It still amazes me that they're so good at some things.
What do you think this particular prompt is evaluating for?
The more popular these particular evals are, the more likely the model will be trained for them.
You think they are able to see their output and iterate on it? Or is it pure token generation?
is there something in your prompt about hats? why the pelican always wearing a hat recently?!
I used the AI studio link and tried running it with the temperature set to 1.75: https://jsbin.com/locodaqovu/edit?html,output
I hope we keep beating this dead horse some more, I'm still not tired of it.
It's an excellent demonstration of the main issue I have with the Gemini family of models, they always go "above and beyond" to do a lot of stuff, even if I explicitly prompt against it. In this case, most of the SVG ends up consisting not just of a bike and a pelican, but clouds, a sun, a hat on the pelican and so much more.
Exactly the same thing happens when you code, it's almost impossible to get Gemini to not do "helpful" drive-by-refactors, and it keeps adding code comments no matter what I say. Very frustrating experience overall.