Large Language Model Influence on Management Reasoning: A Randomized Controlled Trial

Ethan Ethan,Robert Gallo,Eric Strong,Yingjie Weng,Hannah Kerman,Jason Freed,Josephine A Cool,Zahir Kanjee,Kathleen Lane,Andrew S Parsons,Neera Ahuja,Eric Horvitz,Daniel Yang,Arnold Milstein,Andrew PJ Olson,Jason Hom,Jonathan H. Chen,Adam Rodman
DOI: https://doi.org/10.1101/2024.08.05.24311485
2024-08-07
Abstract:Importance: Large language model (LLM) artificial intelligence (AI) systems have shown promise in diagnostic reasoning, but their utility in management reasoning with no clear right answers is unknown. Objective: To determine whether LLM assistance improves physician performance on open-ended management reasoning tasks compared to conventional resources. Design: Prospective, randomized controlled trial conducted from 30 November 2023 to 21 April 2024. Setting: Multi-institutional study from Stanford University, Beth Israel Deaconess Medical Center, and the University of Virginia involving physicians from across the United States. Participants: 92 practicing attending physicians and residents with training in internal medicine, family medicine, or emergency medicine. Intervention: Five expert-developed clinical case vignettes were presented with multiple open-ended management questions and scoring rubrics created through a Delphi process. Physicians were randomized to use either GPT-4 via ChatGPT Plus in addition to conventional resources (e.g., UpToDate, Google), or conventional resources alone. Main Outcomes and Measures: The primary outcome was difference in total score between groups on expert-developed scoring rubrics. Secondary outcomes included domain-specific scores and time spent per case. Results: Physicians using the LLM scored higher compared to those using conventional resources (mean difference 6.5 %, 95% CI 2.7-10.2, p<0.001). Significant improvements were seen in management decisions (6.1%, 95% CI 2.5-9.7, p=0.001), diagnostic decisions (12.1%, 95% CI 3.1-21.0, p=0.009), and case-specific (6.2%, 95% CI 2.4-9.9, p=0.002) domains. GPT-4 users spent more time per case (mean difference 119.3 seconds, 95% CI 17.4-221.2, p=0.02). There was no significant difference between GPT-4-augmented physicians and GPT-4 alone (-0.9%, 95% CI -9.0 to 7.2, p=0.8). Conclusions and Relevance: LLM assistance improved physician management reasoning compared to conventional resources, with particular gains in contextual and patient-specific decision-making. These findings indicate that LLMs can augment management decision-making in complex cases. Trial Registration ClinicalTrials.gov Identifier: ; https://classic.clinicaltrials.gov/ct2/show/
Health Informatics
What problem does this paper attempt to address?