The variance is way too high for this test to have any value at all. I ran it 10 times, and each pelican on a bicycle was a better rendition than that, about half of them you could say were perfect.
Compared to the other benchmarks which are much more gameable, I trust PelicanBikeEval way more.
Well, the variance is itself interesting.
[dead]
Compared to the other benchmarks which are much more gameable, I trust PelicanBikeEval way more.