The Art of Medicine is Still Human
JK
What Two New Studies Tell Us About Where Clinical AI Fails — and How to Fix It
Yesterday the Stanford AI Index came out and what its medicine chapter actually tells us: the tools winning in clinical settings aren’t the most sophisticated ones. They’re the ones designed around how clinicians actually work. Construct determines efficacy. Not capability.
Two studies published this week, both peer-reviewed, both from leading academic medical centers, explain exactly why that’s true. One shows the failure at the reasoning level. The other shows the architectural fix. Together, they make the most compelling case I’ve seen for why human-centered design isn’t a nice-to-have in clinical AI. It’s the variable that determines whether these tools help or harm.
The Failure First
A team from Harvard Medical School and Mass General Brigham just published a landmark evaluation of 21 frontier LLMs — including GPT-5, Claude 4.5 Opus, Gemini 3.0, and Grok 4 — across the full arc of clinical reasoning. Differential diagnosis, diagnostic testing, final diagnosis, management, and clinical reasoning questions. Every model. Every stage.
The finding that should stop every healthcare leader cold:
Failure rates exceeded 80% on differential diagnosis across every single model tested — meaning fewer than 1 in 5 differentials were fully correct. GPT-5. Grok 4. Claude 4.5 Opus. All of them. The most capable AI systems ever built fail more than 8 times out of 10 at the task that sits at the beginning of every clinical encounter.
At the same time, final diagnosis accuracy was relatively strong — often above 85%. The same models that can’t generate a differential diagnosis reliably can identify the correct final answer when given enough information.
This looks paradoxical until you understand what’s actually happening. The authors explain it precisely: clinicians preserve uncertainty and iteratively refine differential diagnoses, whereas LLMs collapse prematurely onto single answers. As corresponding author Marc Succi, MD, stated:
“Differential diagnoses are central to clinical reasoning and underlie the art of medicine that AI cannot currently replicate.”
Clinicians know what patients feel here: uncertainty is the hardest part of waiting through test results for the diagnosis. The brain loops. Stress stays high. People want an answer — any answer — to make it stop. LLMs are great at giving answers. That's the problem. They give them too early, collapsing possibilities when a human would hold them open. That’s not a data problem. It’s not a benchmark problem. It’s not something that will be solved by the next model release. It’s a reasoning architecture problem that has persisted across every generation of models tested.
One important nuance: reasoning-optimized models — GPT-o1, DeepSeek R1, Claude 4.5 Opus — significantly outperformed standard models overall. But even they failed differential diagnosis more than 70% of the time. Better, not safe. The gap narrowed; the problem didn’t go away.
LLMs are extraordinarily good at pattern recognition against known information. They are structurally weak at holding multiple competing hypotheses open simultaneously while iterating toward a conclusion under uncertainty. That gap — between what they can do and what early clinical reasoning actually requires — is where the danger lives.
The Architecture That Fixes It
A second study, from the Windreich Department of AI and Human Health at Mount Sinai, tested a lightweight orchestrator that routes each task to a dedicated worker agent handling one specific job.
The results were striking. As workload scaled, as it always does in real clinical environments, single-agent accuracy collapsed from 73% to 16.6%. Multi-agent accuracy held at 65.3% even at the highest load tested.
The same models. The same technology. A 48-percentage-point accuracy gap at scale, determined entirely by how the work was structured.
The mechanism matters. Each worker agent receives only the tokens relevant to a single decision. Attention isn’t diluted across irrelevant material. The effective context stays within the range the model was designed to handle.
Co-author Mahmud Omar put it plainly:
“When a single agent handles everything, you can’t trace where it went wrong. With the orchestrator, every step is logged — which tool was called, what it returned, and how the answer was assembled. At 80 simultaneous tasks, the single agent dropped to 16 percent accuracy while burning 65 times more compute — and you’d have no way to figure out why. That kind of transparency isn’t optional in medicine.”
What These Two Studies Say Together
The Harvard/MGH paper tells you where LLMs break. They break at the beginning — at the open-ended, uncertainty-laden, iterative reasoning that clinical diagnosis actually requires. They perform well when the task is bounded and the answer is deterministic. They fail when the task requires holding ambiguity and reasoning forward.
The Mount Sinai paper tells you what to do about it. You don’t ask the AI to do the things it can’t do. You constrain the task. You partition the work. You route each decision to the right agent with the right scope. And you keep the human exactly where clinical judgment is required — not as a checkpoint at the end, but as the authority at every stage where uncertainty is real.
To be precise: the Mount Sinai team didn’t test differential diagnosis — they tested bounded tasks like retrieval, extraction, and dosing calculations. Their orchestrator architecture hasn’t been proven to fix the 80% differential diagnosis failure rate. But their architectural logic — constrain scope, isolate context, make every step auditable — is exactly what the Harvard paper suggests as the only responsible way to deploy LLMs clinically today.
This is why the ambient AI scribe works. It doesn’t diagnose. It listens and transcribes. Bounded task. Reviewable output. Clinician authority intact.
This is why the sepsis prediction tool works. One input, one output, one threshold. The AI surfaces the alert. The clinician decides.
And this is why general-purpose clinical reasoning deployments fail. They ask the AI to do the one thing the Harvard paper proves it cannot do — hold diagnostic uncertainty, generate meaningful differentials, reason iteratively toward a conclusion.
There is now direct empirical evidence for what happens when you put the human back in. A separate 2025 study of hybrid human-AI collectives — where physician diagnoses and LLM outputs are combined — found that these hybrid groups outperform both human-only and AI-only groups on differential diagnosis across multiple medical specialties. The reason is precise: when LLMs fail, physicians often provide the correct diagnosis. The human isn’t decorative oversight. The human is an active accuracy contributor — not a supervisor, but a participant — at exactly the point where AI is structurally weakest.
The Design Principle Underneath Both Papers
Neither research team frames their findings as a design argument. But that’s what they are.
The tools that work started with the human — with how clinicians actually move through a patient encounter, where their cognitive load peaks, what decision authority they need to retain, what burden they’d give anything to put down. The technology was shaped around that reality.
The tools that fail started with the capability — with what the model can theoretically do, and the assumption that clinical value would follow from technical performance. It doesn’t.
The Mount Sinai paper’s conclusion is a design specification: constrain the task, partition the work, keep the context bounded, make every step auditable. That’s not an AI architecture document. That’s a human-centered design document written in the language of LLM infrastructure.
The Harvard paper’s conclusion is a warning and a prescription in the same sentence: their most responsible role today is targeted, clinician-supervised use in low-uncertainty tasks. The researchers who just tested every major frontier model are telling us these systems should only be deployed on tasks where uncertainty is low and a clinician is supervising.
That’s a narrow runway. Most of what vendors are currently marketing doesn’t fit inside it.
What This Means for Healthcare Leaders Right Now
The question isn’t whether AI belongs in clinical settings. It does, and the evidence for specific, well-designed tools is now robust. The question is whether the tools you’re evaluating were designed with these constraints in mind or in spite of them.
What is the specific task this tool is performing? The more bounded and deterministic, the stronger the evidence base. The more open-ended and uncertainty-laden, the more caution is warranted, regardless of benchmark performance.
Where does the human retain decision authority? Not as a final reviewer, but as an active participant at every stage where clinical uncertainty is real. Tools designed for this produce 112% ROI. Tools that aren’t produce fewer than 1 in 5 correct differentials.
Can every intermediate step be audited? The Mount Sinai architecture produces a logged, replayable audit trail by design. Most single-agent deployments don’t. In a regulatory environment where AI incidents rose 55% in a single year, that difference is not academic.
The technology is extraordinary. The failures are predictable. The gap between them is design.
Sources:
Rao AS et al. Large Language Model Performance and Clinical Reasoning Tasks. JAMA Network Open. 2026;9(4):e264003. April 13, 2026.
Klang E, Omar M et al. Orchestrated Multi Agents Sustain Accuracy Under Clinical-Scale Workloads Compared to a Single Agent. npj Health Systems. 2026;3:23.
Zöller N et al. Human-AI Collectives Produce the Most Accurate Differential Diagnoses. arXiv. 2406.14981. 2025.
2026 AI Index Report, Stanford Institute for Human-Centered Artificial Intelligence (HAI). April 13, 2026.
