I feel like this time it is indeed in the training set, because it is too good to be true.
Can you run your other tests and see the difference?
If I were them I'd run such requests through a diffusion model, and then try to distill an SVG out of that.
I think at this point we can safely put the pelican test in the category of Goodhart's law.
if they cook these in, i wonder what else was cooked in there to make it look good.
It went pretty wild with "Generate an SVG of a NORTH VIRGINIA OPOSSUM ON AN E-SCOOTER":
https://gist.github.com/simonw/95735fe5e76e6fdf1753e6dcce360...