I'd be very very hesitant to trust studies like this. It's very easy to mess up these benchmarks.
See for example this recent paper where AI managed to beat radiologists on interpreting x-rays... when the AI didn't even have access to the x-rays: https://arxiv.org/pdf/2603.21687 (on a pre existing "large scale visual question answering benchmark for generalist chest x-ray understanding" that wasn't intentionally messed up).
And in interpreting x-ray's human radiologists actually do just look at the x-rays. In the context the article is discussing the human doctors don't just look at the notes to diagnose the ER patient. You're asking them to perform a task that isn't necessary, that they aren't experienced in, or trained in, and then saying "the AI outperforms them". Even if the notes aren't accidentally giving away the answer through some weird side channel, that's not that surprising.
Which isn't to say that I think the study is either definitely wrong, or intentionally deceptive. Just that I wouldn't draw strong conclusions from a single study here.
When you read through the article it shows that the gap between doctors and LLMs actually disappeared (in terms of statistical significance) once both were allowed to read the full case notes.
The headline is quoting a number based on guessed diagnoses from nurse's notes. The LLM was happier to take guesses from the selected case studies than the doctors is my guess.
Interestingly, this recent study using ChatGPT Health gave quite a different outcome (https://www.nature.com/articles/s41591-026-04297-7). Here it was wrong about emergency triage 50% of the time.
I think it's plausible since doctors tend to have human cognitive biases and miss things. People tend to fixate on patterns they're most familiar with.
I haven't finished reading the linked paper, but I'm intrigued by the assumption that the results show illusion or mirage results when not giving access to the x-rays.
It seems like a very reasonable take away, but it skips the other one. Do x-rays make results less accurate?
I think AI can be useful in any kind of context interpretation, but not make a decision.
Could be running in the background on patient data and message the doctor "I see X in the diagnostic, have you ruled out Y, as it fits for reasons a, b, c?"
I like my coding agents the same way, inform me during review on things that I've missed. Instead of having me comb through what it generates on a first pass.
These type of experiments are bound to have biases depending on who is doing it and who is funding it. The experiment is being funded for a particular reason itself to move the narrative in a desired direction. This is probably a good reason to have government funded research in these type of sensitive areas.
Weird that this is the case and a new study.
but those kind of x-ray models are already activly used. They are not used though as a only and final diagnosis. Its more like peer review and priorization like check this image first because it seems most critical today.
hallucination on steroids, wow. I had to read through the abstract to believe it:
"In the most extreme case, our model achieved the top rank on a standard chest Xray question-answering benchmark without access to any images."
I think the bigger takeaway here is that 50% of the time doctors will miss what you have.
I'm even more concerned that current models are not trained to say no, or to even recognize most failure modes.
"Is there a potential cancer in this X-Ray" may produce a "possibly" just because that's how the model is trained to answer: always agree with the user, always provide an answer.
Oh, and don't forget that "Is there a potential cancer in this X-Ray" and "Are there any potential problems in this X-Ray" are two completely different prompts that will lead to wildly different answers.
I agree with you on this specific study, however, I can't really wrap my head about the fact that doctors will be better than AI models on the long-run. After all, medicine is all about knowledge, experience and intelligence (maybe "pattern recognition"), all those, we must assume that the best AI models (especially ones focusing solely in the medical field) would largely beat large majority of humans (aka doctors), if we already have this assumption for software engineers, we should have it for this field as well, and let's be realistic, each time I've seen a doc the last few months (and ER twice), each time they were using ChatGPT btw (not kidding, it chocked me).
So I’m genuinely curious:
What is the specific capability (or combination of capabilities) that people believe will remain permanently (or at least for decades) where a top medical AI cannot match or exceed the performance of a good human doctor? Let's put liability and ethics aside, let's be purely objective about it.