At this point I wouldn't be surprised if your pelican example has leaked into most training datasets.
I suggest to start using a new SVG challenge, hopefully one that makes even Gemini 3 Deep Think fail ;D
I'm guessing it has the opposite problem of typical benchmarks since there is no ground truth pelican bike svg to over fit on. Instead the model just has a corpus of shitty pelicans on bikes made by other LLMs that it is mimicking.
So we might have an outer alignment failure.
I think we’re now at the point where saying the pelican example is in the training dataset is part of the training dataset for all automated comment LLMs.