Abstract:Background: The diagnostic accuracy of differential diagnoses generated by artificial intelligence chatbots, including ChatGPT models, for complex clinical vignettes derived from general internal medicine (GIM) department case reports is unknown. Objective: This study aims to evaluate the accuracy of the differential diagnosis lists generated by both third-generation ChatGPT (ChatGPT-3.5) and fourth-generation ChatGPT (ChatGPT-4) by using case vignettes from case reports published by the Department of GIM of Dokkyo Medical University Hospital, Japan. Methods: We searched PubMed for case reports. Upon identification, physicians selected diagnostic cases, determined the final diagnosis, and displayed them into clinical vignettes. Physicians typed the determined text with the clinical vignettes in the ChatGPT-3.5 and ChatGPT-4 prompts to generate the top 10 differential diagnoses. The ChatGPT models were not specially trained or further reinforced for this task. Three GIM physicians from other medical institutions created differential diagnosis lists by reading the same clinical vignettes. We measured the rate of correct diagnosis within the top 10 differential diagnosis lists, top 5 differential diagnosis lists, and the top diagnosis. Results: In total, 52 case reports were analyzed. The rates of correct diagnosis by ChatGPT-4 within the top 10 differential diagnosis lists, top 5 differential diagnosis lists, and top diagnosis were 83% (43/52), 81% (42/52), and 60% (31/52), respectively. The rates of correct diagnosis by ChatGPT-3.5 within the top 10 differential diagnosis lists, top 5 differential diagnosis lists, and top diagnosis were 73% (38/52), 65% (34/52), and 42% (22/52), respectively. The rates of correct diagnosis by ChatGPT-4 were comparable to those by physicians within the top 10 (43/52, 83% vs 39/52, 75%, respectively; P =.47) and within the top 5 (42/52, 81% vs 35/52, 67%, respectively; P =.18) differential diagnosis lists and top diagnosis (31/52, 60% vs 26/52, 50%, respectively; P =.43) although the difference was not significant. The ChatGPT models' diagnostic accuracy did not significantly vary based on open access status or the publication date (before 2011 vs 2022). Conclusions: This study demonstrates the potential diagnostic accuracy of differential diagnosis lists generated using ChatGPT-3.5 and ChatGPT-4 for complex clinical vignettes from case reports published by the GIM department. The rate of correct diagnoses within the top 10 and top 5 differential diagnosis lists generated by ChatGPT-4 exceeds 80%. Although derived from a limited data set of case reports from a single department, our findings highlight the potential utility of ChatGPT-4 as a supplementary tool for physicians, particularly for those affiliated with the GIM department. Further investigations should explore the diagnostic accuracy of ChatGPT by using distinct case materials beyond its training data. Such efforts will provide a comprehensive insight into the role of artificial intelligence in enhancing clinical decision-making.

Accuracy Evaluation of GPT-Assisted Differential Diagnosis in Emergency Department

ChatGPT-Generated Differential Diagnosis Lists for Complex Case–Derived Clinical Vignettes: Diagnostic Accuracy Evaluation

Evaluating ChatGPT-4's Accuracy in Identifying Final Diagnoses Within Differential Diagnoses Compared With Those of Physicians: Experimental Study for Diagnostic Cases

Diagnostic Performance of ChatGPT to perform emergency department triage: A systematic review and meta-analysis

Diagnostic Accuracy of ChatGPT for Patients' Triage; a Systematic Review and Meta-Analysis

Assessing the precision of artificial intelligence in emergency department triage decisions: Insights from a study with ChatGPT

Evaluation of ChatGPT as a diagnostic tool for medical learners and clinicians

Assessing ChatGPT 4.0's test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports

Are Different Versions of ChatGPT's Ability Comparable to the Clinical Diagnosis Presented in Case Reports? A Descriptive Study

Emergency department triaging using ChatGPT based on emergency severity index principles: a cross-sectional study

AI in the ED: Assessing the efficacy of GPT models vs. physicians in medical score calculation

A Clinical Evaluation of Cardiovascular Emergencies: A Comparison of Responses from ChatGPT, Emergency Physicians, and Cardiologists

ChatGPT-4 Assistance in Optimizing Emergency Department Radiology Referrals and Imaging Selection

Testing the Ability and Limitations of ChatGPT to Generate Differential Diagnoses from Transcribed Radiologic Findings

Performance of GPT-4 and GPT-3.5 in generating accurate and comprehensive diagnoses across medical subspecialties

Evaluation of the prediagnosis and management of ChatGPT-4.0 in clinical cases in cardiology

Exploring ChatGPT's potential in ECG interpretation and outcome prediction in emergency department

The Diagnostic Ability of GPT-3.5 and GPT-4.0 in Surgery: Comparative Analysis

Custom GPTs Enhancing Performance and Evidence Compared with GPT-3.5, GPT-4, and GPT-4o? A Study on the Emergency Medicine Specialist Examination

P467 Towards AI-Augmented Clinical Decision Making: An Examination of ChatGPT's Utility in Acute Ulcerative Colitis Presentations

Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4