The Stanford AI Index: The Winning Tools in Healthcare
JK
Stanford’s 2026 AI Index was published this morning. All 423 pages of it. And while much of the coverage over the next few days will focus on the U.S. - China model gap closing and the $285 billion in U.S. private investment, I want to talk about what the medicine chapter actually tells us — because it validates something I’ve been arguing for years, with data that most healthcare leaders haven’t seen yet.
The tools winning in clinical settings are not the most sophisticated ones. They’re the ones built around how clinicians actually work.
That distinction sounds simple. It isn’t.
Three Tools. Documented Results at Scale.
Stanford identifies three categories of clinical AI that moved from pilot to enterprise-scale in 2025 with measurable outcomes: ambient AI documentation, sepsis prediction, and generative AI embedded directly in EHR workflows.
Ambient AI scribes saw the broadest adoption of any clinical AI category. Abridge expanded from approximately 100 to over 150 health systems in a single year, reaching 63% adoption among hospitals running on Epic. The outcomes were consistent across institutions in ways that are rare in clinical AI research:
• Sharp HealthCare: 83% reduction in note-writing effort
• University of Chicago Medicine: 47% reduction in cognitive load, 58% increase in undivided patient attention
• MaineHealth: 23% reduction in note time, used in 70.3% of encounters
• Northwestern Medicine: 11.3 additional patients seen per month, 24% reduction in documentation time, 112% return on investment
• Stanford Health Care: Statistically significant reductions in burnout, median 20-minute savings per half day of clinic
That kind of consistency across different institutions, different patient populations, and different market segments is not an accident. It’s a signal about construct.
Sepsis prediction produced the most clinically striking outcomes. The TREWS system, deployed across 13 Cleveland Clinic hospitals, reported an 18.7% relative reduction in sepsis mortality and an 89% clinician adoption rate. UC San Diego’s COMPOSER system reported a 17% reduction in sepsis mortality across more than 6,000 admissions, with an estimated 50 lives saved annually.
These aren’t small studies. These are enterprise deployments with documented mortality impact.
The Same Technology. Radically Different Outcomes.
Here’s where it gets important.
The Stanford-Harvard ARISE Network reviewed over 500 clinical AI studies and found that nearly half relied on exam-style questions rather than real patient data. Only 5% used real clinical data. Separately, the NOHARM benchmark found that leading LLMs — the same underlying models powering many of these tools — produced between 11.8 and 14.6 severely harmful recommendations per 100 clinical cases, with 76.6% of those errors being errors of omission: failing to recommend a critical test or intervention.
Same models. Catastrophically different results.
Stanford is explicit about why: those harmful recommendation rates apply to general-purpose LLMs deployed on open-ended clinical reasoning tasks. The tools driving real adoption — ambient scribes and sepsis alerts — operate within constrained workflows with clinician oversight. The technology doesn’t change. The construct does.
This is not a subtle point. It is the entire argument for how healthcare organizations should be evaluating AI investments right now.
Construct Determines Efficacy. Not Capability.
The tools that are winning share three design characteristics that have nothing to do with model sophistication:
1. They operate inside existing workflows, not alongside them. The ambient scribe doesn’t ask a clinician to change anything about how they conduct a patient visit. It listens. It drafts. The human reviews and signs. The workflow is preserved; the burden is removed.
2. They constrain the AI’s decision surface. Sepsis prediction doesn’t ask the AI to reason across all of medicine. It monitors a specific set of variables and surfaces an alert at a defined threshold. The AI does one thing. The clinician decides what to do about it.
3. The human retains decision authority at every step. This isn’t “human in the loop” in the checkbox sense. It’s the human as the through-line — initiating, conferring, implementing, refining. The AI doesn’t replace clinical judgment; it protects the time and cognitive space required to exercise it.
When these three conditions are met, the results are consistent. When they aren’t — when AI is deployed broadly on complex reasoning tasks without constrained scope or human oversight — the results are dangerous.
The Governance Gap Is Wider Than Most Leaders Realize
The FDA authorized 258 AI-enabled medical devices in 2025. But a peer-reviewed analysis of all authorizations through December 2024 found that only 2.4% of devices with clinical studies were supported by randomized controlled trial data. The vast majority entered the market through pathways that rely on existing evidence rather than new clinical trials.
Healthcare leaders are being asked to make AI deployment decisions with an almost entirely absent real-world evidence base. Most clinical AI research still doesn’t use real patient data. Most cleared devices haven’t been validated with rigorous trial methodology.
Multi-agent diagnostic systems are beginning to show extraordinary benchmark results — Microsoft’s AI Diagnostic Orchestrator scored 85.5% on complex NEJM cases versus 20% for unaided physicians. But benchmark results are not clinical outcomes, and the gap between authorization and deployment readiness is still wide.
What This Means for Your Organization
The question I hear most often from health system leaders right now is some version of: “Where do we start?”
The Stanford data gives a clear answer: start with tools that reduce burden on clinicians within workflows they already own. The evidence base is there. The ROI is documented. The adoption rates are high because clinicians don’t experience these tools as disruption — they experience them as relief.
The tools that are failing — or more precisely, the tools producing dangerous results — are the ones deployed with misplaced ambition. Asking a general-purpose AI to reason across an entire patient case is a different design challenge than asking it to transcribe a conversation or alert on a specific physiological threshold.
There’s one more finding worth sitting with. Stanford documents that patient trust in AI is not driven by the technology itself — it’s clinician-mediated. Provider endorsement is the key determinant of patient acceptance. This means that reducing burden on clinicians isn’t just an operational win. It directly produces patient trust in AI tools across the care experience.
The human isn’t a checkpoint in this system. The human is the through-line that makes the whole thing work.
Read Stanford's full report here.
