No benchmark will be perfect, especially if it's public but it's a fun experiment to visually see how these models get better and better.