EvidenceAtlasv0.2 — Last updated 2026-02-20

Living EvidenceAtlas: Ambient AI Scribes

This chapter is actively maintained as new evidence emerges. It serves as a learning resource for HCHE 336: Evidence-Based AI in Life Sciences & Healthcare at Morehouse College, and is shared publicly to support evidence-based decision-making across the health AI community.

Executive Summary

Ambient AI scribes — tools that listen to clinician-patient conversations and automatically generate clinical notes — are among the fastest-adopted AI technologies in healthcare. But what does the evidence actually say about how well they work?

We reviewed six studies representing the current evidence base, including the first-ever randomized controlled trial of an ambient AI scribe. Here's what we found.

Burnout Reduction: Strong Evidence

The headline finding is real. The first randomized controlled trial of an ambient AI scribe found a significant reduction in work exhaustion and interpersonal disengagement among 66 practitioners over 24 weeks, with a number needed to treat of less than two [1]. A separate pre-post study across six U.S. health systems corroborates this, showing burnout prevalence dropping from 52% to 39% among 186 clinicians after 30 days [2]. The convergence across two different study designs, using different measurement instruments, strengthens confidence that ambient scribes meaningfully reduce clinician burnout.

An important nuance: the benefit is specific to reducing exhaustion, not increasing fulfillment. Professional fulfillment — the other half of clinician well-being — did not significantly improve in the RCT [1]. Ambient scribes appear to relieve what's draining clinicians, but they don't independently create what energizes them.

Time Savings: Moderate to Strong Evidence

Objective EHR audit log data from the RCT shows clinicians spent approximately 22 fewer minutes per day on notes after adopting an ambient scribe [1]. This is substantially more than earlier syntheses suggested. A companion RCT testing two different products found that one (Nabla) reduced time-in-note by 9.5% compared to the control group (significant), while the other (DAX Copilot) showed only a 1.7% reduction (not significant) [3]. Self-reported data from the pre-post study showed a reduction of approximately 11 minutes per workday in after-hours documentation time across six health systems [2].

The key takeaway: time savings are real but vary dramatically by product. The difference between 22 minutes per day and essentially no improvement over control is the difference between a meaningful operational gain and a negligible one.

Note Quality and Accuracy: Moderate Evidence

AI-generated notes scored high across multiple quality dimensions in the RCT: accuracy (4.44/5), thoroughness (4.57/5), usefulness (4.83/5), and linguistic quality (4.63–4.99/5) [1]. The weakest area was abstraction and synthesis (the ability to integrate information and draw clinical inferences), which scored 3.97/5. Stigmatizing language appeared in less than 1% of notes.

Separately, a study comparing LLM-generated discharge summaries to physician-written ones found comparable overall quality (3.7 vs. 3.8 on a 5-point scale), with AI outputs being more concise and coherent but less comprehensive and containing more errors per summary (2.9 vs. 1.8) [4]. These findings support a "draft-plus-review" model: AI generates, clinicians verify.

Billing Compliance: Strong Evidence (Single Study)

In a head-to-head comparison of over 6,000 notes, AI-generated notes showed significantly better ICD-10 billing compliance than human-authored notes (6.87 vs. 5.94 on a 0–10 scale) [1]. This is a potentially meaningful financial lever: more accurate coding can reduce claim denials and capture revenue more completely. However, this finding comes from a single study and awaits replication.

Patient Volume and Capacity: No Evidence of Impact

Neither the RCT nor the pre-post study found that ambient scribes increased the number of patients clinicians could see [1][2]. Perceived control over scheduling also did not improve [1]. Ambient scribes improve the experience of documentation work, but they do not expand clinical throughput.

Safety and Quality Assurance: Emerging Evidence

An automated fact-checking tool (VeriFact) demonstrated that AI can verify claims in clinical notes against the patient's medical record, achieving 93% agreement with clinician consensus, actually exceeding clinicians' own inter-rater agreement on AI-generated text (93% vs. 89%) [5]. However, that headline number is partly inflated by the fact that most claims in clinical notes are correct (class imbalance); the tool's ability to discriminate between correct and incorrect claims is more moderate. Importantly, VeriFact can only check whether what's written is supported by the record. It cannot detect what's missing from a note. It was also tested on discharge summaries rather than ambient scribe notes specifically. Still, it suggests that automated quality assurance for AI-generated clinical text is a feasible direction.

Explore what this evidence means for your context

Use our interactive EvidenceSimulator to see how these findings apply to your product choice, baseline burnout rate, implementation approach, and practice setting.

Explore what this evidence means for your context

Detailed Evidence Review

1. Clinician Burnout and Well-Being

This is the most thoroughly studied area. Three studies directly measured burnout-related outcomes.

The RCT (Afshar et al., NEJM AI, 2025) [1] is the strongest evidence available. In a 24-week stepped-wedge individually randomized pragmatic trial at an academic health system in Wisconsin and Illinois, 66 healthcare practitioners were randomized to begin using an ambient AI scribe (Abridge) at staggered intervals. The study used the Stanford Professional Fulfillment Index (PFI), a validated burnout instrument.

Work exhaustion and interpersonal disengagement decreased by 0.44 points on a 5-point scale (P<0.001). At baseline, 56% of practitioners scored above the high burnout threshold; by the end of the study, this dropped to 35%, a 21-percentage-point reduction. The number needed to treat was 1.68, meaning that for roughly every two practitioners given access to the scribe, one experienced a clinically meaningful reduction in burnout.

Critically, professional fulfillment (the positive dimension of well-being) increased by only 0.14 points and did not reach statistical significance after correction for multiple comparisons (P=0.04 vs. the required threshold of P<0.025) [1]. The authors note that professional fulfillment may be influenced by broader organizational factors like staffing levels, professional autonomy, and team climate that scribes do not address.

Multiple secondary measures also improved significantly: task load decreased, impact on personal relationships improved, perceived clinical efficiency increased, meaningfulness of work rose, and trust in AI grew. One secondary measure did not improve: perceived control over work schedule (P=0.382) [1].

This study had notable methodological strengths: no vendor funding (funded by the university hospital and NIH), all authors affiliated with the university rather than the vendor, preregistered on ClinicalTrials.gov, and analytic code publicly available.

The pre-post study (Olson et al., JAMA Network Open, 2025) [2] examined 263 clinicians across six academic and community-based U.S. health systems who used an ambient AI scribe (Abridge) for 30 days. Using a validated single-item burnout measure, burnout prevalence decreased from 52% to 39% among the 186 participants included in the burnout analysis, a 13-percentage-point drop, with 74% lower odds of burnout after adjustment.

Additional findings included a large reduction in cognitive task load (2.64 points on a 10-point scale), improved focused attention on patients (2.05 points on a 10-point scale), and a self-reported decrease of approximately 11 minutes per workday in after-hours documentation time [2].

This study had broader reach (six health systems, 263 participants) but a weaker design (pre-post with no control group, non-anonymous survey) and notable vendor involvement (three of eleven authors had affiliations with the ambient scribe company, including the company's chief clinical officer during the study period). The analysis was performed independently by the Yale-affiliated authors, and the study was preregistered. These design limitations and conflicts of interest don't invalidate the findings, but they do mean this study should be interpreted alongside, not as a substitute for, the independent RCT evidence.

Notably, the pre-post study also found that ambient scribe use was not associated with any significant increase in the number of additional patients clinicians could add to their schedule (P=0.58) [2], the only nonsignificant secondary outcome, and one that was not highlighted in earlier syntheses of this data.

The companion RCT (Lukac et al., NEJM AI, 2025) [3] tested two different products (Microsoft DAX Copilot and Nabla) in a three-arm parallel trial at an academic health system in California (238 outpatient physicians, 14 specialties). This trial was powered for documentation efficiency, not burnout, but its secondary burnout outcomes tell an interesting story. The DAX arm showed a work exhaustion reduction whose confidence interval excluded zero (PFI-WE −0.32, CI −0.55 to −0.08), and both products showed Mini-Z composite burnout improvements whose CIs excluded zero. However, as secondary outcomes from a trial not powered for these measures, these are suggestive signals that need confirmation, not proof of effect.

Utilization was approximately 30% for both products, less than half of what the Afshar RCT achieved (71%) [1], which may partly explain the weaker primary outcome results. The gap between studies highlights that implementation approach matters as much as the technology itself.

Evidence strength: Strong for burnout/exhaustion reduction with one product in one setting (RCT-confirmed, corroborated by pre-post study). Suggestive but unconfirmed for other products (secondary outcomes with CIs excluding zero, but trial not powered for burnout). Not significant for professional fulfillment in any study.

2. Documentation Time Savings

Time savings are real but more nuanced than headlines suggest.

Objective EHR data from the RCT [1] shows time on notes decreased by 0.36 hours per day (approximately 22 minutes) as measured through EHR audit logs, providing objective data rather than clinician self-report. This finding was robust across sensitivity analyses. Work done outside of normal hours also appeared to decrease (0.50 hours/day), but this finding was sensitive to statistical outliers and became nonsignificant when the top 3% of extreme values were excluded.

The companion RCT [3] measured time-in-note as a between-group percent change versus a usual-care control arm. Nabla showed a significant 9.5% reduction (P=0.02); DAX Copilot showed a nonsignificant 1.7% reduction (P=0.66). The raw within-arm time declines were 41 seconds (Nabla) and 18 seconds (DAX), but the control group also declined by 23 seconds over the same period, meaning DAX's incremental effect over doing nothing was negligible. Utilization was approximately 30% for both products, compared to 71% in the Afshar trial, which may partly explain the weaker results.

Self-reported data from the pre-post study [2] showed after-hours documentation time decreasing by approximately 11 minutes per workday (10.8 min/day), along with additional reductions in total documentation time and EHR queue time. These self-reported figures are directionally consistent with the Afshar RCT's objective measurement of 22 minutes per day, though the pre-post study measured a narrower slice of documentation work (after-hours only). The convergence across studies and measurement methods strengthens the overall time savings signal for this product.

An important caveat from both the RCT and the pre-post study: same-day encounter closure rates and follow-up timeliness did not significantly improve [1]. Documenting the clinical encounter is only a fraction of total EHR time, and both studies note that further productivity gains may depend on expanding AI capabilities beyond note generation to include downstream tasks like inbox management, order entry, and patient communication [1][2][6].

Evidence strength: Strong for daily documentation time reduction (~22 min/day from EHR audit log in one RCT). Moderate for per-note savings (product-dependent). Limited for after-hours work reduction (sensitive to outliers).

3. Note Quality and Documentation Accuracy

The RCT [1] evaluated nearly 8,000 AI-generated ambient scribe notes using the PDSQI-9, a validated note quality instrument scored by an LLM-as-judge system that was itself validated against physician ratings. Quality was high across most domains on a 5-point scale: accuracy (including falsification and fabrication) scored 4.44, thoroughness scored 4.57, usefulness scored 4.83, and linguistic quality dimensions (organization, comprehensibility, succinctness) ranged from 4.63 to 4.99.

The lowest-scoring domain was abstraction and synthesis at 3.97, reflecting the difficulty of integrating information from multiple sources and drawing clinical inferences. This makes intuitive sense: ambient scribes are fundamentally transcription-to-structure tools, and higher-order clinical reasoning remains harder for AI.

Stigmatizing language was detected in fewer than 1% of notes (12 out of 7,966). No quality drift was detected across 15 sequential monthly monitoring windows, and a planned software update during the study had no significant impact on note quality or utilization [1].

A separate study on AI-generated discharge summaries [4] compared LLM-generated (GPT-4) and physician-written narratives across 100 inpatient encounters. Blinded reviewers found comparable overall quality (3.7 vs. 3.8). AI outputs were rated more concise and coherent, but less comprehensive, and contained more errors per summary (2.9 vs. 1.8). Overall harmfulness scores remained low for both. This supports treating AI-generated clinical text as a draft requiring clinician review rather than a finished product.

Evidence strength: Moderate to strong for overall note quality (single RCT with large note sample and validated instrument). Moderate for error rates (AI errors slightly higher than human but low-harm). Limited for long-term quality stability (24-week window only).

4. Billing and Coding Compliance

The RCT [1] included a billing compliance analysis that compared ICD-10 coding alignment in over 6,000 notes (roughly half AI-generated, half human-authored), adjudicated by professional staff coders using a standardized rubric. AI-generated notes scored significantly higher: 6.87 vs. 5.94 on a 0–10 compliance scale (P<0.001).

This likely reflects the fact that ambient scribes generate more complete and structured documentation of clinical encounters, which in turn supports more accurate coding. This is a quantifiable financial benefit: improved coding accuracy can reduce claim denials and capture revenue that would otherwise be left on the table.

No study has yet measured the direct revenue impact of this improved coding, and the finding comes from a single health system. But it represents the first RCT-level evidence of a billing compliance advantage for AI-generated clinical notes.

Evidence strength: Strong within a single study (RCT, objective adjudication, large sample). Limited in breadth (single site, awaiting replication).

5. Utilization and Implementation

Utilization (how often clinicians actually use the scribe once it's available) varied dramatically across studies, and this may be the most underappreciated finding in the evidence base.

The first RCT [1] achieved 71% utilization (note-weighted mean), with over 27,000 of 71,000 notes generated using AI. Seven of 66 practitioners showed low fidelity: three never initiated and four used the tool inconsistently. The study used a voluntary enrollment model where practitioners had to be willing to adopt the technology.

The companion RCT [3] achieved approximately 30% utilization, less than half of the first study, using different products. The lower utilization may partly explain the weaker time savings and burnout findings.

The pre-post study [2] reported that 95.9% of participants generated at least five notes using the scribe, but did not report ongoing utilization rates. Prior documentation methods among participants were diverse: 83% used manual typing, 85% used templates, 47% used dictation, and 16% had used human scribes.

The gap between 71% and 30% utilization is substantial and suggests that implementation approach, product experience, and organizational support matter as much as the underlying technology. Satisfaction surveys do not predict actual utilization; a health system where clinicians say they like the tool may still see low adoption in practice.

Evidence strength: Moderate (utilization varies widely across studies; implementation factors appear to drive adoption as much as product quality).

6. Patient Experience and Outcomes

This is the thinnest area of the evidence base. No study has directly measured patient outcomes or patient-reported experience with ambient AI scribes.

The pre-post study [2] found a small but statistically significant improvement in clinician-reported patient comprehension of care plans (0.44 points on a 10-point scale, P=0.005) and a modest improvement in urgent patient access (0.51 points, P=0.02). Both are clinician perceptions, not patient-reported measures.

The RCT [1] explicitly did not collect patient-reported outcomes, which the authors acknowledge as a limitation.

Patient consent rates were very high in the RCT (99.92%), suggesting patients are generally comfortable with ambient recording [1]. But comfort with the technology does not tell us whether it improves or changes the care they receive.

Evidence strength: Limited. Patient-reported outcomes and clinical outcomes are the most significant gap in the ambient AI scribe evidence base.

7. Safety Considerations

Error rates in AI-generated clinical text are slightly higher than in physician-written text. The discharge summary study [4] found 2.9 errors per AI summary vs. 1.8 per physician summary, though overall harmfulness was low. The RCT's note quality data [1] showed high accuracy (4.44/5) but the abstraction/synthesis weakness (3.97/5) suggests that errors of omission or oversimplification may be the primary risk.

The companion RCT reported the first adverse safety event in the ambient scribe RCT literature [3]: a grade-1 (mild) event in which extensive patient counseling was omitted from the AI-generated assessment and patient instructions. Clinicians in that study also rated inaccuracies as "occasional" (approximately 2.7–2.8 on a 5-point frequency scale). This is a concrete example of the omission risk: the scribe didn't fabricate information, but it failed to capture something clinically important that happened during the visit.

Automated fact-checking is emerging but has important limitations. The VeriFact system [5] demonstrated that individual claims in clinical notes can be verified against the patient's medical record using a retrieval-augmented LLM pipeline. Its best configuration achieved 93.2% agreement with clinician consensus on AI-generated claims, exceeding the 88.5% agreement clinicians achieved with each other. The system labels each claim as supported, not supported, or not addressed by the medical record. All models used are open-source and locally hosted, meaning no patient data leaves the institution.

However, the 93% headline requires context. Most claims in clinical notes are correct ("Supported"), so the high agreement rate is partly driven by class imbalance. The Matthews correlation coefficient (a metric less affected by class imbalance) was 0.31, indicating only moderate discriminative ability. Agreement on "Not Supported" and "Not Addressed" categories was much lower (31–34%), even among clinicians evaluating the same claims. In other words, the system is good at confirming that correct facts are correct, but substantially less reliable at catching errors.

A critical limitation for ambient scribe use: VeriFact can only verify whether what IS written in a note is supported by the medical record. It cannot detect errors of omission (content that should be in the note but isn't). For ambient scribes, where a key risk is the scribe failing to capture something the clinician said or did, this is a significant blind spot.

VeriFact was also tested on discharge summaries rather than ambient scribe encounter notes, a different document type generated in a different context. The authors suggest that fact-checking clinical notes could become analogous to spell-checking, but the current evidence supports this as a complementary safety layer, not a complete one.

The prediction that over 90% of clinical note text will be AI-generated in 2026 [6] raises a governance question: if notes are overwhelmingly AI-authored, do they still faithfully represent what the clinician actually observed, thought, and decided? This is an unresolved medicolegal and clinical quality concern that health systems should address proactively through attestation policies and documentation review requirements.

Evidence strength: Moderate for error rates (consistent across studies: AI errors slightly higher, severity low). Moderate for automated fact-checking feasibility (single study, promising results). Limited for long-term safety in production deployments.

Limitations of the Current Evidence

Transparency about what we don't know is as important as what we do.

The strongest results come from one product. The RCT [1] and the pre-post study [2] both evaluated Abridge exclusively. The companion RCT [3] tested DAX Copilot and Nabla in a head-to-head design and found substantially weaker time savings (DAX's effect was negligible over control) and unconfirmed burnout signals. Whether Abridge's positive findings generalize to other products is a significant open question, and the companion RCT suggests they may not.

The strongest study is from a single academic site. The RCT was conducted at one health system in Wisconsin and Illinois, with a practitioner population that was 89% white, 79% female, and 46% family medicine [1]. Whether these results hold in community hospitals, safety-net systems, rural settings, or among practitioners with different demographics and practice patterns is unknown.

Longest follow-up is 24 weeks. The RCT showed sustained effects across four time points over 24 weeks [1], which is encouraging. But whether burnout reduction and quality gains persist beyond one year is untested. Novelty effects and expectancy bias (the study was open-label) cannot be fully excluded.

No patient outcomes have been measured. We know ambient scribes reduce clinician exhaustion and documentation time. We do not know whether this translates to better clinical decisions, fewer errors, improved patient satisfaction, or better health outcomes. This is the largest gap in the evidence.

Sample sizes are small for a technology deployed at scale. The RCT studied 66 practitioners; the pre-post study analyzed 263. Meanwhile, ambient scribes are being deployed across health systems with thousands of clinicians. The gap between evidence scale and deployment scale is notable.

All studies are from U.S. academic health systems. Generalizability to non-U.S. settings, community hospitals, safety-net systems, and rural practices is unproven. The companion RCT [3] covered 14 specialties (broader than the first RCT's predominantly family medicine population), but the field still lacks evidence from the full range of clinical settings where ambient scribes are being deployed.

The discharge summary and fact-checking studies used different AI systems. The LLM discharge summary comparison [4] used GPT-4 in a standalone setting, not an ambient scribe product. VeriFact [5] was tested on discharge summaries, not on real-time scribe-generated encounter notes. Extrapolating these findings to ambient scribe workflows requires caution.

The Six Studies Behind This Atlas

[1]Afshar M, Baumann MR, Resnik F, et al. A Pragmatic Randomized Controlled Trial of Ambient Artificial Intelligence to Improve Health Practitioner Well-Being. NEJM AI. 2025;2(12):AIoa2500945. doi:10.1056/AIoa2500945. ClinicalTrials.gov: NCT06517082.
[2]Olson KD, Meeker D, Troup M, et al. Use of Ambient AI Scribes to Reduce Administrative Burden and Professional Burnout. JAMA Network Open. 2025;8(10):e2534976. doi:10.1001/jamanetworkopen.2025.34976.
[3]Lukac PJ, Turner W, Vangala S, et al. Ambient AI Scribes in Clinical Practice: A Randomized Trial. NEJM AI. 2025;2(12):AIoa2501000. doi:10.1056/AIoa2501000.
[4]Williams CYK, Subramanian CR, Ali SS, et al. Physician- and Large Language Model–Generated Hospital Discharge Summaries. JAMA Intern Med. 2025;185(7):818-825. doi:10.1001/jamainternmed.2025.0821.
[5]Chung P, Swaminathan A, Goodell AJ, et al. Verifying Facts in Patient Care Documents Generated by Large Language Models Using Electronic Health Records. NEJM AI. 2025;3(1). doi:10.1056/AIdbp2500418.
[6]Brodeur P, Goh E, Rodman A, Chen JH, et al. State of Clinical AI Report 2026. January 2026. arise-ai.org.

This atlas was developed through the EvidenceCycle methodology at Chief Health AI, as a learning aid for students in HCHE 336: Evidence-Based AI in Life Sciences & Healthcare at Morehouse College.

Know of a study we're missing? We want to hear about it. Comment on our LinkedIn post or reach out directly. If your organization is evaluating ambient AI scribes and needs the full EvidenceCycle (including deployment thresholds, testing frameworks, and monitoring signals), learn more about our programs.

Suggested citation: Chief Health AI. Living EvidenceAtlas: Ambient AI Scribes. 2026-02-20. chiefhealthai.com.

This atlas reflects Chief Health AI's independent analysis of the published evidence as of 2026-02-20. It is not medical or legal advice. For the full evidence review, contact Chief Health AI.