Interesting "ScreenSpot Pro" results:
72.7% Gemini 3 Pro
11.4% Gemini 2.5 Pro
49.9% Claude Opus 4.5
3.50% GPT-5.1
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer UseThat is... astronomically different. Is GPT-5.1 downscaling and losing critical information or something? How could it be so different?
impressive.....most impressive
its going to reach low 90s very soon if trends continue
I was surprised at how poorly GPT-5 did in comparison to Opus 4.1 and Gemini 2.5 on a pretty simple OCR task a few months ago - I should run that again against the latest models and see how they do. https://simonwillison.net/2025/Aug/29/the-perils-of-vibe-cod...