Progress has not become linear. We've just hit the limits of what we can measure and explain easily.
One year ago coding agents could barely do decent auto-complete.
Now they can write whole applications.
That's much more difficult to show than an ELO score based on how people like emjois and bold text in their chat responses.
Don't forget Llama4 led Lmarena and turned out to be very weak.