In terms of "does useless refactors I didn't ask for nor improved anything", my own ranked list goes something like: Gemini > Claude > GPT. I don't really experience this at all with various GPT models used via the API, but overall GPTs seems to stick to the system prompt way better than the rest. Clause does OK too, but Gemini is out of control and writes soo much code and does so much you didn't ask for, really acts like a overly eager junior developer.
The first time I used Claude, it rewrote >1k LOC without asking for it, but in retrospect, I was "using it wrong". With GPT, even when I told it to not do it, it still did that, but that was some time ago and it was not done via the API, so I dunno. I think I do agree with your list, but I haven't used Gemini that much.
Yeah, they do come across as "overly eager junior devs", good comparison. :D