One of the interesting things to me about this is that Codex 5.2 found the most complex of the exploits.
The reflects my experience too. Opus 4.5 is my everyday driver - I like using it. But Codex 5.2 with Extra High thinking is just a bit more powerful.
Also despite what people say, I don't believe progress in LLM performance is slowing down at all - instead we are having more trouble generating tasks that are hard enough, and the frontier tasks they are failing at or just managing are so complex that most people outside the specialized field aren't interested enough to sit through the explanation.
The “hard enough” tasks are all behind IP walls. If it’s a “hard enough” that generally means it’s a commercial problem likely involving disparate workflows and requiring a real human who probably isn’t a) inclined and/or b) permitted, to publish the task. The incentives are aligned to capture all value from solving that task as long as possible and only then publish.
The Anthropic models are great workers/tool users. OpenAI Codex High is a great reviewer/fixer. Gemini is the genius repainting your bathroom walls into a Monet from memory because you mentioned once a few weeks ago you liked classical art and needed to repaint your bathroom. Gemini didn’t mention the task or that it was starting it. It did a pretty good job after you had to admit.