I make ChatGPT and Claude code review each other's outputs. ChatGPT thinks its solutions are better than what Claude produces. What was more surprising to me is that Claude, more often than not, prefers ChatGPT's responses too.
I am to sure one can really extrapolate much out of that, but I do find it interesting nonetheless.
I think language is also an important factor. I have a hard time deciding which of the two LLMs is worse at Swift, for example. They both seem equally great and awful in different ways.
I do the same (I have both review a piece of code), and Codex tend to produce more nitpicky feedback. Opus usually agrees with it on around half the feedback, but says that the other half is too nitpicky to implement. I generally agree with Opus' assessment, and do agree that Codex nitpicks a lot.
I can't even use Codex for planning because it goes down deep design rabbit holes, whereas Opus is great at staying at the proper, high level.