Love this benchmark, always the first place I look. Also seems like it is time to move the goalposts, not sure we are getting enough resolution between models anymore.
Out of curiosity why does gemini get gold for the poker example but gpt-image 1.5 does not? I couldn't see a difference between the two.