Daniel McDuff,Mike Schaekermann,Tao Tu,Anil Palepu,Amy Wang,Jake Garrison,Karan Singhal,Yash Sharma,Shekoofeh Azizi,Kavita Kulkarni,Le Hou,Yong Cheng,Yun Liu,S Sara Mahdavi,Sushant Prakash,Anupam Pathak,Christopher Semturs,Shwetak Patel,Dale R Webster,Ewa Dominowska,Juraj Gottweis,Joelle Barral,Katherine Chou,Greg S Corrado,Yossi Matias,Jake Sunshine,Alan Karthikesalingam,Vivek Natarajan

Abstract:An accurate differential diagnosis (DDx) is a cornerstone of medical care, often reached through an iterative process of interpretation that combines clinical history, physical examination, investigations and procedures. Interactive interfaces powered by Large Language Models (LLMs) present new opportunities to both assist and automate aspects of this process. In this study, we introduce an LLM optimized for diagnostic reasoning, and evaluate its ability to generate a DDx alone or as an aid to clinicians. 20 clinicians evaluated 302 challenging, real-world medical cases sourced from the New England Journal of Medicine (NEJM) case reports. Each case report was read by two clinicians, who were randomized to one of two assistive conditions: either assistance from search engines and standard medical resources, or LLM assistance in addition to these tools. All clinicians provided a baseline, unassisted DDx prior to using the respective assistive tools. Our LLM for DDx exhibited standalone performance that exceeded that of unassisted clinicians (top-10 accuracy 59.1% vs 33.6%, [p = 0.04]). Comparing the two assisted study arms, the DDx quality score was higher for clinicians assisted by our LLM (top-10 accuracy 51.7%) compared to clinicians without its assistance (36.1%) (McNemar's Test: 45.7, p < 0.01) and clinicians with search (44.4%) (4.75, p = 0.03). Further, clinicians assisted by our LLM arrived at more comprehensive differential lists than those without its assistance. Our study suggests that our LLM for DDx has potential to improve clinicians' diagnostic reasoning and accuracy in challenging cases, meriting further real-world evaluation for its ability to empower physicians and widen patients' access to specialist-level expertise.

Conversational Disease Diagnosis via External Planner-Controlled Large Language Models

Automatic Interactive Evaluation for Large Language Models with State Aware Patient Simulator

AI Hospital: Interactive Evaluation and Collaboration of LLMs As Intern Doctors for Clinical Diagnosis

Testing the Limits of Language Models: A Conversational Framework for Medical AI Assessment

Large Language Models for Disease Diagnosis: A Scoping Review

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Large Language Models Illuminate a Progressive Pathway to Artificial Intelligent Healthcare Assistant

Towards Conversational Diagnostic AI

Integrating Physician Diagnostic Logic into Large Language Models: Preference Learning from Process Feedback

LLM-Mini-CEX: Automatic Evaluation of Large Language Model for Diagnostic Conversation

AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator

Leveraging Large Language Model as Simulated Patients for Clinical Education

Large Language Models as Agents in the Clinic

Enhancing Clinical Accuracy of Medical Chatbots with Large Language Models

Beyond Direct Diagnosis: LLM-based Multi-Specialist Agent Consultation for Automatic Diagnosis

Building Conversational Diagnosis Systems for Fine-Grained Diseases Using Few Annotated Data.

An Automatic Evaluation Framework for Multi-turn Medical Consultations Capabilities of Large Language Models

Evaluating large language models as agents in the clinic

ChatCAD: Interactive Computer-Aided Diagnosis on Medical Image using Large Language Models

Towards Accurate Differential Diagnosis with Large Language Models

Large Language Models and Medical Knowledge Grounding for Diagnosis Prediction