I advise a medical non profit and we ran a series of tests against cases doctors input to our system looking for specialist recommendations.
Our findings found that gpt-5-mini performed better than gpt-5, sonnet 4 and medgemma.
I think these studies are very hard to accurately score. But in any case, AI seems to do a very good job compared to humans. Unsurprising, really.