OpenAI's o1 correctly diagnosed 67% of ER patients vs. 50-55% by triage doctors

275 points • by donsupreme • yesterday at 12:30 AM • 230 comments • view on HN

Comments

I'd be very very hesitant to trust studies like this. It's very easy to mess up these benchmarks.

See for example this recent paper where AI managed to beat radiologists on interpreting x-rays... when the AI didn't even have access to the x-rays: https://arxiv.org/pdf/2603.21687 (on a pre existing "large scale visual question answering benchmark for generalist chest x-ray understanding" that wasn't intentionally messed up).

And in interpreting x-ray's human radiologists actually do just look at the x-rays. In the context the article is discussing the human doctors don't just look at the notes to diagnose the ER patient. You're asking them to perform a task that isn't necessary, that they aren't experienced in, or trained in, and then saying "the AI outperforms them". Even if the notes aren't accidentally giving away the answer through some weird side channel, that's not that surprising.

Which isn't to say that I think the study is either definitely wrong, or intentionally deceptive. Just that I wouldn't draw strong conclusions from a single study here.

➕ show 10 replies

lukko • yesterday at 8:24 PM

I'm surprised at both the article and the paper - both seem very hyperbolic. This is LLMs competing against doctors in a way that is heavily weighted in the LLMs favour, which does not represent clinical practice. These reasoning cases are not benchmarks for doctors, they are learning tools.

I think it's important to note that diagnosis also relies on accurate description of the patient in the first place, and the information you gather depends on the differential diagnosis. Part of the skill of being a doctor is gathering information from lots of different sources, and trying to filter out what is important. This may be from the patient, who may not be able to communicate clearly or may be non verbal, carers and next of kin. History-taking is a skill in itself, as well as examination. Here those data are given.

For pattern recognition from plain text, especially on questions that may be in the o1's training data, I'm not surprised at all that it would outperform doctors, but it doesn't seem to be a clinically useful comparison. Deciding which investigations to do, any imaging, and filtering out unnecessary information from the history is a skill in itself, and can't really be separated from forming the diagnosis.

➕ show 1 reply

01100011 • yesterday at 10:00 PM

I wouldn't put much weight in this study, but I think a lot of us can still attest to the usefulness of LLMs in self-diagnostics. The reality in the US is that it is difficult to get the attention and care of a doctor so we're left having to do it ourselves. 10 years ago you'd hear docs complaining about patients coming in with things they found on google but now I don't think there's an alternative.

Case in point, I went to a podiatrist for foot and ankle issues. He diagnosed my foot issues from the xray but just shrugged his shoulders for the ankle issues and said the xray didn't show anything. My 15 minute allocation of his attention expired and I left without a clue as to the issue or what corrective actions to take. 5 minutes with an LLM and I had a plausible reason for the ankle issues which aligned with the diagnosis in my foot.

creativeSlumber • yesterday at 8:14 PM

> "An AI and a pair of human doctors were each given the same standard electronic health record to read"

This is handicapping the human doctors abilities. There is a lot more information a human doctor can gather even with a brief observation of the patient.

➕ show 7 replies

jmpman • yesterday at 2:19 PM

Besides for myself and wife, I've also used LLMs to diagnose my dogs. Convinced there's a huge opportunity for AI based veterinary, especially one which then performs bidding across the local veterinary clinics to perform the care/surgeries. I've noticed that local vets vary in price by more than an order of magnitude. My 80 year old mother and mother inlaw have been regularly scammed by over charging vets, and with their dogs being a major part of their lives, they extremely susceptible to pressure.

gizmodo59 • yesterday at 10:42 PM

The negative reactions here are baffling me. The fact that we can even get to say 30% with computer is amazing. So much hatred towards AI and anything from the frontier labs like OpenAI (or Goog for that matter) makes no sense.

➕ show 4 replies

Kuyawa • today at 12:39 AM

As a 60yo I developed my own AI medical assistant [1] and I've used it extensively for many conditions, I can't be happier. After analyzing some lab tests it even recommended a marker that was not considered first by the doctor, so yes, it won't replace doctors but it is a very helpful tool for self-diagnosing simple conditions and second opinions.

[1] https://mediconsulta.net (DeepSeek)

biglost • today at 12:47 AM

Me da curiosidad, me gustaría saber si ese 33% es un subconjunto del 50-45% Si no es un subconjunto, entonces que tan grave fue ese error? Más muertes? Más tiempo de recuperación? En qué se tradujo esa diferencia?

droidjj • yesterday at 7:59 PM

The paper: https://www.science.org/doi/10.1126/science.adz4433 (April 30, 2026)

arkt8 • today at 12:01 AM

How much far is 67% against 55%? Does the research considered same patients as the doctors?

How much it can be effective for science if it is not compared side by side how each scenario was evaluated by both and how it came to different conclusions.

Who can ensure a doctor couldn't spot some blind point AI couldn't at the remaining 43%.

Tools are not for replacement but combining efforts.

Throw such % to the public is a lot of irresponsibility.

mawadev • yesterday at 8:35 PM

I don't think AI is a good use case for such critical situations. Maybe in a decade we have AI help out doctors with doing a pre check. What if Ai finds nothing and the doctor does not bother to look into it further? It is this small question which breaks the technology from any angle later down the road from my POV. AI has to stay optional here.

Even if AI is used to sample or summarize a lot of data that a human couldn't do in time: What if it misses something that a human won't? What if a human inversely misses something that AI won't? Would you rather trust the machine or the human? (Especially if the human is held accountable.)

OptionOfT • yesterday at 8:19 PM

As a 37 year old male with 2 THRs I'm glad the AI was NOT used in my diagnosis. All the models that I used to look at my x-rays said nothing was wrong, even when adding symptoms. When adding age it said the patient was too young.

(I was ~3 months away from wheelchair bound in those x-rays).

The worst one was Gemini. Upload an x-ray of just the right hip, and it started to talk about how good the left hip looked like.

I think with AI taking over it's gonna be harder to get a solution when your problem isn't the run-of-the mill.

➕ show 2 replies

tedggh • yesterday at 8:42 PM

Believable and not shocking. LLMs literally may have saved my sons and potentially her mother too by allowing us to fact check a lot of non sense data and scare tactics by a group of at least 5 different doctors ambushing us to make a life changing decision in minutes. The problem is doctors, at least in the US, prioritize liability exposure over patients long term outcomes. Let’s say you need an intervention where two options A and B are available to you. A carries 1% risk of complications but a great outcome. Option B has 0.1% risk of complications but once you are discharged the short term effects are challenging and long term effects not well understood. Well, 10/10 times doctors will suggest option B and will do anything they can to nudge you into making that choice, like not telling you the absolute numbers and constantly using the word “death”. They also lie about the outcomes, because again, once you accept the procedure, sign and are sent home, they have nothing to do with you.

➕ show 2 replies

beering • yesterday at 4:55 AM

o1 is several generations old and was released in 2024. Is this some quite old research that took a long time to get published?

➕ show 2 replies

chromacity • yesterday at 8:41 PM

All the other points raised in this thread aside, it seems like an odd thing to benchmark because a significant proportion of ER practice is dealing with emergencies, often accidental injuries. There's not a whole of diagnosing going on if you show up to ER with a gash on your forehead or a missing finger.

wiseowise • yesterday at 8:46 PM

The Pitt third season leak? All of the ER is fired and Robbie is fighting schizophrenia with 15 agents and Dana?

tsoukase • yesterday at 10:59 PM

This reminds me GPT-4 era studies where the LLM was better in a Law school exam than a student. We are not in 2023 anymore, or in the case of medicine, are we? If yes, this is bad news for health related applications as the low hanging fruits in LLM have been cut off.

SkiFreeWin3 • yesterday at 7:32 PM

Yes, but what was the overlap

jmcgough • yesterday at 8:54 PM

LLMs can be a useful second opinion for a highly educated patient with good insight into their health and body, but this is not the average patient I see in an urban emergency department. Many patients can't give a cohesive history without a skilled clinician who can ask the right questions and read between the lines.

I am very skeptical of studies like this that don't adequately reflect real world conditions, but when I was a software engineer I probably wouldn't have understood what "real" medicine is like either.

getnormality • yesterday at 11:19 PM

Wow, amazing. They had an AI robot running o1 look at live ER patients coming in just like a real doctor and they did that much better? Incredible! (literally)

afro88 • yesterday at 9:33 PM

I wonder about the nuance within the data. Like does AI do much worse with children than adults, but still better overall for example. Or biological male vs female. I think we'd want it to do better across all groups, ages etc so we're not introducing some kind of horrible bias resulting in deaths or serious health consequences for some groups

1980phipsi • yesterday at 11:19 PM

How much time do the doctors spend to diagnose versus o1?

llbbdd • yesterday at 11:28 PM

Can't happen soon enough. If the bar was as high as it needed to be, there'd be like one qualified doctor on Earth so far.

PAndreew • yesterday at 11:36 PM

I mean an LLM is a slightly stirred up soup of current human knowledge. It has an advantage in quantity of accumulated data and maybe connecting seemingly less connected parts of that data - but not reliably. The human has an advantage (for now) in data collection (seeing, hearing sensing the patient), actual agency, real world experiences and getting the useful data out of the stirred up soup. Both human and LLM are susceptible to bias and harmful influence. Let’s simply isolate them in the diagnostic process and then compare their output. Human collects data -> both human and LLM evaluate independently -> compare the results -> human may get new insights -> final diagnosis by human.

jmathai • yesterday at 9:26 PM

I advise a medical non profit and we ran a series of tests against cases doctors input to our system looking for specialist recommendations.

Our findings found that gpt-5-mini performed better than gpt-5, sonnet 4 and medgemma.

I think these studies are very hard to accurately score. But in any case, AI seems to do a very good job compared to humans. Unsurprising, really.

gamerslexus • today at 12:03 AM

Hold on. Does this mean ER diagnoses are marginally better than pure chance?

➕ show 1 reply

david_mchale • yesterday at 10:51 PM

having been in ERs too many times when they are beyond capacity, something like this would be better than patients slipping through the cracks, at least you get a chance.

SpyCoder77 • yesterday at 7:42 PM

This is a rather new article about an old model...

➕ show 1 reply

theshrike79 • yesterday at 7:58 PM

I'll repeat my idea on how this MUST be done:

1. AI gets data about the patient and makes a diagnosis. This is NOT shown to doctor yet.

2. Doctor does their stuff, writes down their diagnosis. This diagnosis is locked down and versioned.

3. Doctor sees AI's diagnosis

4. Doctor can adjust their diagnosis, BUT the original stays in the system.

This way the AI stays as the assistant and won't affect the doctor's decision, but they can change their mind after getting the extra data.

➕ show 3 replies

arkt8 • yesterday at 11:54 PM

how much confidence is 67%? does it was at the same patients with the same info? If not it is just selling bait.

swisniewski • yesterday at 9:06 PM

Let’s assume the AI does out perform the DR.

I still want humans in the loop, interpreting the LLMs findings and providing a sanity check.

You can’t hold an LLM accountable.

That’s the min responsible bar for LLM authored code, which normally doesn’t really matter much. For something as important as ER diagnostics, having a human in the loop is crucial.

The narrative that these tools are replacing human intelligence rather than augmenting it is, quite frankly, stupid.

We should embrace these tools.

But, “eliminating DRs”… hardly.

Aurornis • yesterday at 9:43 PM

Gell-Mann Amnesia kicks in hard as soon as the LLM topic changes to a profession other than our own. It’s much easier to believe an LLM can outperform someone else doing their job than to believe that it’s a good idea to replace your own work with an LLM.

The number in the headline isn’t even a good comparison because they asked doctors to make a diagnosis from notes a nurse typed up. Doctors are trained to be conservative with diagnosing from someone else’s notes because it’s their job to ask the patient questions and evaluate the situation, whereas an LLM will happily leap to a conclusion and deliver it with high confidence

When they allowed both humans and doctors access to more information about the case, the difference between groups collapsed into statistical insignificance:

> The diagnosis accuracy of the AI – OpenAI’s o1 reasoning model – rose to 82% when more detail was available, compared with the 70-79% accuracy achieved by the expert humans, though this difference was not statistically significant.

Talking to my medical professional friends, LLMs are becoming a supercharged version of Dr. Google and WebMD that fueled a lot of bad patient self-diagnoses in the past. Now patients are using LLMs to try to diagnose themselves and doing it in a way where they start to learn how to lead the LLM to the diagnosis they want, which they can do for a hundred rounds at home before presenting to the doctor and reciting the script and symptoms that worked best to convince the LLM they had a certain condition.

LeCompteSftware • yesterday at 7:34 PM

It is easy to overinterpret this based on the headline, the doctors were actually at a slight disadvantage. This isn't how they normally work, this is a little more like a med school pop quiz:

  An AI and a pair of human doctors were each given the same standard electronic health record to read – typically including vital sign data, demographic information and a few sentences from a nurse about why the patient was there. The AI identified the exact or very close diagnosis in 67% of cases, beating the human doctors, who were right only 50%-55% of the time.... The study only tested humans against AIs looking at patient data that can be communicated via text. The AI’s reading of signals, such as the patient’s level of distress and their visual appearance, were not tested. That means the AI was performing more like a clinician producing a second opinion based on paperwork.

"I don't know, let's run more tests" is also a very important ability of doctors that was apparently not tested here. In addition to all the normal methodological problems with overinterpreting results in AI/LLMs/ML/etc. Sadly I do think part of the problem here is cynical (even maniacal) careerist doctors who really shouldn't be working at hospitals. This means that even though I am generally quite anti-LLM, and really don't like the idea of patients interacting with them directly, I am a little optimistic about these being sanity/laziness checkers for health professionals.

DeepYogurt • yesterday at 9:39 PM

Who's accountable for the 33%?

colechristensen • yesterday at 8:06 PM

I think this is more a commentary on how bad ER diagnosis is.

adamtaylor_13 • yesterday at 10:30 PM

Despite what I suspect the general consensus on HN may be, this does not surprise me at all.

My wife was recently diagnosed with Mast Cell Activation Syndrome (MCAS) after a pretty scary series of ER visits. It's a very strange and stubborn autoimmune disease that manifests with a number of symptoms that, taken individually, could indicate damn near anything.

You could almost feel the doctors rolling their eyes as she explained her symptoms and medical history.

Anyway... it lit a bit of a fire in me to dig deeper, and one day Claude suggested MCAS. I started plugging in more labs, asking for Claude to cross-reference journals mentioning MCAS, and sure enough: it's MCAS.

idk what the moral of the story is except our current medical system is a joke. The doctors aren't the villains, but they sure aren't the heroes either.

➕ show 1 reply

Lihh27 • yesterday at 8:53 PM

radiology already had its "AI beats doctors" moment. radiologists are still here. what changed first was the workflow, not the specialty. er is probably next.

➕ show 1 reply

kian • yesterday at 10:08 PM

But what was the overlap?

lvl155 • yesterday at 8:55 PM

I’ve some family in medicine and it scares me how much they now rely on AI. Some even quote it like Bible.

SilverElfin • yesterday at 7:37 PM

I’ve had much better luck with diagnosis of my own family’s issues than with doctors. Usually now, I’m feeding them more information to begin with, so that their 30 minute office visits are not wasted, requiring another expensive follow up appointment.

While I’m sure there can be ways in which such studies are wrong, it’s very obvious that AI can accelerate work in many of these areas where we seek out professional help - doctors, lawyers, etc.

➕ show 1 reply

journal • yesterday at 7:51 PM

would it ever diagnose incorrectly to save more lives? kinda weird an ai would decide who die so others may survive, but i guess whatever.

➕ show 1 reply

Aboutplants • yesterday at 7:12 PM

Now show me the result of Triage Doctors with aided AI help

bluefirebrand • yesterday at 7:23 PM

Unfortunately, from my understanding Doctors don't necessarily diagnose for accuracy, they often diagnose to limit liability.

They aren't going to take a stab at an uncommon diagnosis even if it occurs to them, if they might get sued if they're wrong.

Edit: I'm not trying to say Doctors deliberately diagnose wrong. Just that if there are two possible diagnoses, one common that matches some of the symptoms and one rare that matches all symptoms, doctors are still much more likely to diagnose the common one. Hoofbeats, horses, zebras, etc

appz3 • yesterday at 9:21 PM

[flagged]

Noahxel • yesterday at 7:56 PM

[flagged]

wg0 • yesterday at 7:33 PM

The Guardian needs to raise their bar on what to report and how to give readers full context on the ongoing NFT AI trust me bro crypto scam and that context would be that it is a mathematical model of human language and not medical expert or replacement for one.

➕ show 3 replies

Bender • yesterday at 1:42 PM

Humans could not diagnose and treat me correctly. They almost killed me. Curious where I could feed my symptoms and the same data I gave to an ER to an AI to test it.

➕ show 2 replies

taurath • yesterday at 7:23 PM

I’d love to see a follow to that radiologist evaluation, where it failed so miserably on the thing it was supposed to be the best at that now there’s a shortage of radiologists.

➕ show 1 reply

alt Hacker News

OpenAI's o1 correctly diagnosed 67% of ER patients vs. 50-55% by triage doctors

Comments